This project investigates spatial patterns and brand-related trends in Amazon search result data. The dataset, collected through browser extensions, includes the position (top/left coordinates), brand affiliation, and promotional status (e.g., Amazon Prime, Sponsored) of products shown in user searches.
The notebook explores whether Amazon-branded products receive more favorable placement than third-party products. I perform exploratory data analysis (EDA), calculate frequency metrics, and visualize spatial trends to assess potential bias in search result rankings.
This notebook focuses on data cleaning, visual pattern discovery, and early insights generation.
# Libraries import for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.cluster import KMeans
# Data Import
# Imported the Amazon user dataset and inspected first few rows
data = pd.read_csv('C:/Users/Baljot/Desktop/Old School/Data 301/A3/Amazon_data.csv')
data.head()
Unnamed: 0 | base_spell | subspell | date_created_day | Top | Left | asin | is_targeted_brand | search_result_amazonprime | search_result_usedoptions | ... | used_offers | subsave_option | name_fit | name_cosine_distance | nonamefit | nonamedistance | brand | major_brand | has_amazon_brands | has_other_brands | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 8/31/2022 | 226.00000 | 283.20001 | B098QR8Q4N | False | False | False | ... | False | False | 0.400000 | 0.186015 | False | NaN | Warriors: | False | 0 | 0 |
1 | 2 | 1 | 1 | 8/31/2022 | 461.05002 | 283.20001 | B000VYX8L8 | False | False | False | ... | False | False | 0.293103 | 0.119305 | False | NaN | Warriors | False | 0 | 0 |
2 | 3 | 1 | 1 | 8/31/2022 | 722.65002 | 283.20001 | B0B6Y7SNT9 | False | False | False | ... | False | False | 0.333333 | 0.303127 | False | NaN | Warriors: | False | 0 | 0 |
3 | 4 | 1 | 1 | 8/31/2022 | 1341.45010 | 283.20001 | B09RKBVZVJ | False | False | False | ... | False | False | 0.363636 | 0.310618 | False | NaN | Warriors: | False | 0 | 0 |
4 | 5 | 1 | 1 | 8/31/2022 | 1571.65000 | 283.20001 | B09N8T1DWR | False | False | False | ... | False | False | 0.363636 | 0.170995 | False | NaN | Warriors | False | 0 | 0 |
5 rows × 68 columns
Several new columns were created to support the analysis:
date_created_day
: Converted the original date strings into datetime
format to enable time-based grouping and visualization.
amazon_brand_count
: Added a binary indicator to count whether a product was affiliated with an Amazon brand.
amazon_rank_value
: Extracted the rank values for Amazon-branded products to calculate their average search ranking over time.
targeted_brand_count
: Created a count for targeted brands appearing in search results to track their visibility trends.
major_brand_count
: Established a binary marker identifying whether a product belonged to a major brand.
These features were used throughout the exploratory analysis to better understand brand presence, ranking behaviors, and potential spatial biases in search result placements.
data.columns
Index(['Unnamed: 0', 'base_spell', 'subspell', 'date_created_day', 'Top', 'Left', 'asin', 'is_targeted_brand', 'search_result_amazonprime', 'search_result_usedoptions', 'search_result_outofstock', 'search_result_best_seller', 'search_result_sponsored', 'search_result_sponsored_tag', 'search_result_rank', 'search_result_resultdetail', 'search_result_stars', 'search_result_ratings', 'search_result_newprice', 'search_result_oldprice', 'search_result_unitprice', 'search_result_deliveryrule', 'search_result_deliverytime', 'search_result_price', 'search_result_stockleft', 'search_result_coupon', 'search_result_freedelivery', 'search_result_used_price', 'search_result_used_offers', 'search_result_discount_subsave', 'search_result_freeshipping', 'search_result_brand_subtitle', 'rank_full', 'rank_data_index', 'rank_unique_page', 'amazon_brand', 'search_results_stars', 'ratings', 'norating', 'stars', 'nostars', 'price', 'price_discount', 'noprice', 'delivery_speed', 'min_for_freedelivery', 'delivery_fee', 'nodeliveryfee', 'free_delivery', 'free_delivery_possible', 'delivery_date', 'delivery_days', 'nodeliverydt', 'noinfostockleft', 'coupon', 'n_other_offers', 'no_n_other_offers', 'new_offers', 'used_offers', 'subsave_option', 'name_fit', 'name_cosine_distance', 'nonamefit', 'nonamedistance', 'brand', 'major_brand', 'has_amazon_brands', 'has_other_brands', 'amazon_brand_count', 'major_brand_count', 'total_brand_count'], dtype='object')
targeted_pct = np.mean(data['is_targeted_brand'] == True) * 100
print(f"Amazon-branded products make up {targeted_pct:.2f}% of all results.")
Amazon-branded products make up 1.26% of all results.
# Separated the data based on brand
targeted_brand = data[data['is_targeted_brand'] == True]
not_targeted_brand = data[(data['is_targeted_brand'] == False) & (data['major_brand']==True)]
# Scatter plot with different colors for points based on the condition
plt.figure(figsize=(10, 6))
plt.scatter(targeted_brand['Left'], targeted_brand['Top'], marker='o', color='#00A8E1', label='Amazon Targeted Brand', alpha=0.5)
plt.scatter(not_targeted_brand['Left'], not_targeted_brand['Top'], marker='o', color='red', label='Not Targeted Brand', alpha=0.5)
plt.xlabel('Left (X-coordinate)')
plt.ylabel('Top (Y-coordinate)')
plt.title('Approximate positioning of Brands')
plt.legend()
plt.grid(True)
plt.show()
Converted the date_created_day
column to datetime format for proper time-based analysis and plotting.
data['date_created_day'] = pd.to_datetime(data['date_created_day'])
To better understand the scope of the data, I extracted the minimum and maximum dates from the date_created_day
column.
This provides context on how long the data was collected over time.
min_date = data['date_created_day'].min()
max_date = data['date_created_day'].max()
print(min_date, max_date)
2022-06-17 00:00:00 2023-01-09 00:00:00
Behind every product search lies a quiet competition. This chart visualizes the proportion of Amazon-branded and major-branded products appearing in search results over time — a daily tug-of-war between Amazon’s own offerings and major household names.
Ratios were calculated by dividing the count of branded products by the total number of listings each day. The plot highlights shifts in visibility, revealing who gets seen, who gets sidelined, and how brand presence shifts in the shadows of the algorithm.
# Convert boolean values to int
data['amazon_brand_count'] = data['amazon_brand'].astype(int)
data['major_brand_count'] = data['major_brand'].astype(int)
data['total_brand_count'] = 1 # Every row = 1 product
# Group data by date
grouped = data.groupby('date_created_day').agg({'amazon_brand_count': 'sum', 'total_brand_count': 'sum'})
major_grouped = data.groupby('date_created_day').agg({'major_brand_count': 'sum', 'total_brand_count': 'sum'})
# Calculate daily ratios
grouped['amazon_brand_ratio'] = grouped['amazon_brand_count'] / grouped['total_brand_count']
major_grouped['major_brand_ratio'] = major_grouped['major_brand_count'] / major_grouped['total_brand_count']
# Plot
plt.figure(figsize=(12, 6))
plt.plot(grouped.index, grouped['amazon_brand_ratio'], color='#00A8E1', marker='o', label='Amazon Brand Ratio')
plt.plot(major_grouped.index, major_grouped['major_brand_ratio'], color='red', marker='o', label='Major Brand Ratio')
plt.xlabel('Date')
plt.ylabel('Brand Ratio')
plt.title('Amazon vs Major Brand Ratios Over Time')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
This line chart compares the daily average search result rank of Amazon-branded and major-branded products over time. In the competitive landscape of online retail, rank determines visibility — and visibility drives outcomes. Lower values indicate higher placement on the page. Dashed horizontal lines show the overall average rank for each group, highlighting consistent gaps in positioning. The visualization reveals how small differences in search ranking can quietly shape which brands dominate the customer’s attention.
# Calculate the average rank of Amazon's brands for each time period
amazon_average_ranks = data[data['amazon_brand']].groupby('date_created_day')['rank_full'].mean()
major_average_ranks = data[data['major_brand']].groupby('date_created_day')['rank_full'].mean()
# Visualizes the Amazon Brand Rank Average over time
plt.figure(figsize=(12, 6))
plt.plot(amazon_average_ranks.index, amazon_average_ranks.values, color='#00A8E1', marker='o')
plt.plot(major_average_ranks.index, major_average_ranks.values, color='red', marker='o')
# Adds a mean line
plt.axhline(major_average_ranks.mean(), color='red', linestyle='--', label=f'Avg Major Prevalence: {major_average_ranks.mean():.2f}')
plt.axhline(amazon_average_ranks.mean(), color='#00A8E1', linestyle='--', label=f'Avg Amazon Prevalence: {amazon_average_ranks.mean():.2f}')
plt.xlabel('Date')
plt.ylabel('Brand Rank Average')
plt.title('Brand Rank Average Over Time')
plt.grid(True)
plt.legend()
plt.show()
This chart shows the daily average search result position of Amazon and major-branded products using rank_data_index
, which records placement per individual user search. rank_data_index
was selected over rank_full
due to its clearer construction and direct interpretability. Lower values correspond to better visibility in search results.
# Calculate the average rank by brand type for each time period
amazon_average_ranks = data[data['amazon_brand']].groupby('date_created_day')['rank_data_index'].mean()
major_average_ranks = data[data['major_brand']].groupby('date_created_day')['rank_data_index'].mean()
# Visualize the Amazon Brand Rank Average over time
plt.figure(figsize=(12, 6))
plt.plot(amazon_average_ranks.index, amazon_average_ranks.values, color='#00A8E1', marker='o')
plt.plot(major_average_ranks.index, major_average_ranks.values, color='red', marker='o')
# Adds a mean line
plt.axhline(amazon_average_ranks.mean(), color='#00A8E1', linestyle='--', label=f'Avg Amazon Prevalence: {amazon_average_ranks.mean():.2f}')
plt.axhline(major_average_ranks.mean(), color='red', linestyle='--', label=f'Avg Major Prevalence: {major_average_ranks.mean():.2f}')
plt.xlabel('Date')
plt.ylabel('Brand Result Position Average')
plt.title('Brand Result Position Average Over Time')
plt.grid(True)
plt.legend()
plt.show()
This chart traces the daily count of Amazon-branded and major-branded products appearing across user searches. Grouped by date, the counts reveal how often each brand type surfaced over time. Dashed horizontal lines mark the average daily prevalence for Amazon and major brands, offering a steady benchmark against the day-to-day fluctuations. Together, these trends highlight how brand presence shifts within the search landscape.
# Filters the DataFrame to select Amazon's brands and major brands
amazon_brands_df = data[data['amazon_brand']]
major_brands_df = data[data['major_brand']]
# Groups by the relevant time period (e.g., date) and counts the occurrences of Amazon's brands and major brands
amazon_brand_counts = amazon_brands_df.groupby('date_created_day').size()
major_brand_counts = major_brands_df.groupby('date_created_day').size()
# Visualize the Brand Prevalence over time
plt.figure(figsize=(12, 6))
plt.plot(amazon_brand_counts.index, amazon_brand_counts.values, label='Amazon Brands', color='#00A8E1', marker='o')
plt.plot(major_brand_counts.index, major_brand_counts.values, label='Major Brands', color='red', marker='o')
plt.axhline(amazon_brand_counts.mean(), color='#00A8E1', linestyle='--', label=f'Avg Amazon Prevalence: {amazon_brand_counts.mean():.2f}')
plt.axhline(major_brand_counts.mean(), color='red', linestyle='--', label=f'Avg Major Prevalence: {major_brand_counts.mean():.2f}')
plt.xlabel('Date')
plt.ylabel('Brand Prevalence')
plt.title('Brand Prevalence Over Time')
plt.grid(True)
plt.legend()
plt.show()
This analysis explored brand visibility patterns within Amazon search results, focusing on Amazon-branded and major-branded products. Key trends were identified by comparing daily brand frequency, average ranking position, and overall brand prevalence. The dataset revealed that Amazon's own brands consistently occupied higher-ranking positions and appeared more frequently than major competitors. These patterns suggest potential brand favoritism in search result visibility, raising important considerations around competition and platform neutrality.
Further analysis with expanded datasets or additional metadata (e.g., click-through rates, category filters) could deepen these insights.