Image by Author
# Introduction
When you solve enough interview-style data problems, you start noticing a funny effect: the dataset “shape” quietly dictates your coding style. A time-series table nudges you toward window functions. A star schema pushes you into JOIN chains and GROUP BY. A pandas task with two DataFrames almost begs for .merge() and isin().
This article makes that intuition measurable. Using a set of representative SQL and pandas problems, we will identify basic code-structure traits (common table expression (CTE) utilization, the frequency of window functions, common pandas techniques) and illustrate which elements prevail and the reasons behind this.
# Why Data Structure Changes Your Coding Style
Rather than just logic, data problems are more like constraints wrapped in tables:
// Rows That Depend On Other Rows (Time, Rank, “Previous Value”)
If each row’s answer depends on adjacent rows (e.g. yesterday’s temperature, previous transaction, running totals), solutions naturally lean on window functions like LAG(), LEAD(), ROW_NUMBER(), and DENSE_RANK().
Consider, for example, this interview question’s tables:
Each customer’s result on a given day cannot be determined in an isolated way. After aggregating order costs at the customer-day level, each row must be evaluated relative to other customers on the same date to determine which total is highest.
Because the answer for one row depends on how it ranks relative to its peers within a time partition, this dataset shape naturally leads to window functions such as RANK() or DENSE_RANK() rather than simple aggregation alone.
// Multiple Tables With Roles (Dimensions vs Facts)
When one table describes entities, and another describes events, solutions tend toward JOIN + GROUP BY patterns (SQL) or .merge() + .groupby() patterns (pandas).
For instance, in this interview question, the data tables are the following:
In this example, since entity attributes (users and account status) and event data (downloads) are separated, the logic must first recombine them using JOINs before meaningful aggregation (exactly the dimension) can take place. This fact pattern is what creates JOIN + GROUP BY solutions.
// Small Outputs With Exclusion Logic (Anti-Join Patterns)
Problems asking “who never did X” often become LEFT JOIN … IS NULL / NOT EXISTS (SQL) or ~df['col'].isin(...) (pandas).
# What We Measure: Code Structure Characteristics
To compare “coding style” across different solutions, it’s useful to identify a limited set of observable features that can be extracted from SQL text and Python code.
While these may not be flawless indicators of solution quality (e.g. correctness or efficiency), they can serve as trustworthy signals regarding how analysts engage with a dataset.
// SQL Features We Measure
// Pandas Features We Measure
# Which Constructs Are Most Common
To move beyond anecdotal observations and quantify these patterns, you need a more straightforward and consistent method to derive structural signals directly from solution code.
As a concrete anchor for this workflow, we used all educational questions on the StrataScratch platform.
In the result shown below, “total occurrences” is the raw count of times a pattern appears across all code. A single question’s solution could use JOIN 3 times, so those 3 all add up. “Questions using” concerns how many distinct questions have at least one occurrence of that feature (i.e. a binary “used / not used” per question).
This method reduces each solution to a limited set of observable features, enabling us to consistently and reproducibly compare coding styles across problems and to associate dataset structure with dominant constructs directly.
// SQL Features
// Pandas Features (Python Solutions)
// Feature Extraction Code
Below, we present the code snippets used, which you can use on your own solutions (or rephrase answers in your own terms) and extract features from the code text.
// SQL Feature Extraction (Example)
import re
from collections import Counter
sql = # insert code here
SQL_FEATURES = {
"cte": r"\bWITH\b",
"join": r"\bJOIN\b",
"group_by": r"\bGROUP\s+BY\b",
"window_over": r"\bOVER\s*\(",
"dense_rank": r"\bDENSE_RANK\b",
"row_number": r"\bROW_NUMBER\b",
"lag": r"\bLAG\b",
"lead": r"\bLEAD\b",
"not_exists": r"\bNOT\s+EXISTS\b",
}
def extract_sql_features(sql: str) -> Counter:
sql_u = sql.upper()
return Counter({k: len(re.findall(p, sql_u)) for k, p in SQL_FEATURES.items()})
// Pandas Feature Extraction (Example)
import re
from collections import Counter
pandas = # paste code here
PD_FEATURES = {
"merge": r"\.merge\s*\(",
"groupby": r"\.groupby\s*\(",
"rank": r"\.rank\s*\(",
"isin": r"\.isin\s*\(",
"sort_values": r"\.sort_values\s*\(",
"drop_duplicates": r"\.drop_duplicates\s*\(",
"transform": r"\.transform\s*\(",
}
def extract_pd_features(code: str) -> Counter:
return Counter({k: len(re.findall(p, code)) for k, p in PD_FEATURES.items()})
Let’s now talk in more detail about patterns we noticed.
# SQL Frequency Highlights
// Window Functions Surge In “highest Per Day” And Tie-friendly Ranking Tasks
For example, in this interview question, we are asked to compute a daily total per customer, then select the highest result for each date, including ties. This is a requirement that naturally leads to window functions such as RANK() or DENSE_RANK(), segmented by day.
The solution is as follows:
WITH customer_daily_totals AS (
SELECT
o.cust_id,
o.order_date,
SUM(o.total_order_cost) AS total_daily_cost
FROM orders o
WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY o.cust_id, o.order_date
),
ranked_daily_totals AS (
SELECT
cust_id,
order_date,
total_daily_cost,
RANK() OVER (
PARTITION BY order_date
ORDER BY total_daily_cost DESC
) AS rnk
FROM customer_daily_totals
)
SELECT
c.first_name,
rdt.order_date,
rdt.total_daily_cost AS max_cost
FROM ranked_daily_totals rdt
JOIN customers c ON rdt.cust_id = c.id
WHERE rdt.rnk = 1
ORDER BY rdt.order_date;
This two-step approach — aggregate first, then rank within each date — shows why window functions are ideal for “highest per group” scenarios where ties need to be maintained, and why basic GROUP BY logic is inadequate.
// CTE Usage Increases When The Question Has Staged Computation
A common table expression (CTE) (or multiple CTEs) keeps each step readable and makes it easier to validate intermediate results.
This structure also reflects how analysts think: separating data preparation from business logic, allowing the query to be simpler to understand, troubleshoot, and adapt as needs change.
// JOIN Plus Aggregation Becomes The Default In Multi-table Business Metrics
When measures live in one table and dimensions in another, you often cannot avoid JOIN clauses. Once joined, GROUP BY and conditional totals (SUM(CASE WHEN ... THEN ... END)) are usually the shortest path.
# Pandas Method Highlights
// .merge() Appears Whenever The Answer Depends On More Than One Table
This interview question is a good example of the pandas pattern. When rides and payment or discount logic span columns and tables, you typically first combine the data, then count or compare.
import pandas as pd
orders_payments = lyft_orders.merge(lyft_payments, on='order_id')
orders_payments = orders_payments[(orders_payments['order_date'].dt.to_period('M') == '2021-08') & (orders_payments['promo_code'] == False)]
grouped_df = orders_payments.groupby('city').size().rename('n_orders').reset_index()
result = grouped_df[grouped_df['n_orders'] == grouped_df['n_orders'].max()]['city']
Once the tables are merged, the remainder of the solution reduces to a familiar .groupby() and comparison step, underscoring how initial table merging can simplify downstream logic in pandas.
# Why These Patterns Keep Appearing
// Time-based Tables Often Call For Window Logic
When a problem refers to totals “per day,” comparisons between days, or selecting the highest value for each date, ordered logic is normally required. For this reason, ranking functions with OVER are common, especially when ties must be preserved.
// Multi-step Business Rules Benefit From Staging
Some problems mix filtering rules, joins, and computed metrics. It is possible to write everything in a single query, but this increases the difficulty of reading and debugging. CTEs help with this by separating enrichment from aggregation in a way that is easier to validate, aligning with the Premium vs Freemium model.
// Multi-table Questions Naturally Increase Join Density
If a metric depends on attributes stored in a different table, joining is required. Once tables are combined, grouped summaries are the natural next step. That overall shape shows up repeatedly in StrataScratch questions that mix event data with entity profiles.
# Practical Takeaways For Faster, Cleaner Solutions
- If the output depends on ordered rows, expect window functions like
ROW_NUMBER()orDENSE_RANK() - If the question reads like “compute A, then compute B from A,” a WITH block usually improves clarity.
- If the dataset is split across multiple entities, plan for JOIN early and decide your grouping keys before writing the final select.
- In pandas, treat
.merge()as the default when the logic spans multiple DataFrames, then build the metric with.groupby()and clean filtering.
# Conclusion
Coding style follows structure: time-based and “highest per group” questions tend to produce window functions. Multi-step business rules tend to produce CTEs.
Multi-table metrics increase JOIN density, and pandas mirrors these same moves through .merge() and .groupby().
More importantly, recognizing these structural patterns early on can significantly alter your approach to a new problem. Instead of starting from syntax or memorized tricks, you can reason from the dataset itself: Is this a per-group maximum? A staged business rule? A multi-table metric?
This change in mindset allows you to anticipate the main framework prior to writing any code. Eventually, this results in quicker solution drafting, simpler validation, and more consistency across SQL and pandas, because you are responding to the data structure, not just the question text.
Once you learn to recognize the dataset shape, you can predict the dominant construct early. That makes solutions faster to write, easier to debug, and more consistent across new problems.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.
