Structured Query Language
In most cases, this is the first thing that any person involved in the data world learns apart from the usual Excel & Spreadsheets.
Over the years, I’ve personally used SQL extensively and it’s the one language I truly understand and use almost every week in and out.
It’s used in ad-hoc analysis, reports, dashboards and any type of backend or database activity that I’m doing.
MySQL. PostgreSQL. Microsoft SQL. Amazon Redshift.
Almost used most of the forms out there including the most popular ones above, they all have similar syntax but differ in certain specifications across the board.
I realised there are many repetitions of code snippets or logic that’s happening and I was wasting time in the same over and over again. This made me start a simple note file called “SQL Learnings” so that I can go back and refer to it anytime I need it.
And now, that it has enough notes and learnings, I’ve decided to make it public on this page and
Keep updating it as soon as significant additions are made.
Some of my learnings are code-based/syntax based, some logic based and some on very basic stuff for my own reminders/mistakes done in the past.
Note: This whole set of learnings was made for one sole reason, to help yours truly here in repetitive work and to have one resource to go to.
I would highly recommend everyone to create their own lists so as to help them and reduce re-work every time they start working.
Logic based learnings (High level)
- The cost of
queryis very important to check before running heavy and big ones. Got warnings in multiple cases and to avoid this in future. Test and experiment small and then check costs when running big queries
- LEVEL of DATA – Very important to check and understand in any work and business problem! Understand that first before anything else Eg: Daily level, Daily-plan level for some Subscription offers etc
- Think with Scale even when writing basic code. Even a simple SQL code can be written with scale. So think it heavy and long term
- Whenever doing some date plays or timelines plays or looking at pockets of time – use calendar table and do cross joins as and when needed.
- To test your SQL code and logic quickly and platforms for learning more use the following:
- SQL Fiddle
- Domo dataflows (assuming you use it already)
- WINDOW functions: Getting and operating across a set of table rows that are connected to the current row. Rows are not grouped and they retain their separate identities.
- This is more of a Toolbox & Notepad hack
.When using “IN” function – if too many variables to add in the list either Excel or Notepad++ Regex replace features. In
In Notepad++, you can go to “Find and Replace” and then Type ^ – to add at the start of all lines and $ to add at end of all lines, wherein you can fill it with single quotes, commas etc as needed for your codes.
Code & Functions based learnings (programming level)
- For separating Long Text from delimiters in one column only – use unnest OR regexp split to table functions directly.
Eg: A column had many skillsets separated by “;” delimited and this is the code we wrote to separate them out and get the entire list available
Query: select username, unnest(string_to_array(skills_col, ‘,’)) AS
anyarray) – converts an array to a set of rows, interesting feature and can be used as needed
Series :generate_series is available in Postgresql and can be quite useful in many areas where complete data is not there. start, stop and step or step_interval are the parameters available for this function.
- Using IN operator vs JOIN – always use latter (make tables distinct and then use it not in one base join as regular ones) IN operator is okay for very small scale and checks only, not for large scale
- Whenever adding/committing code, do thorough testing of # ROWS while JOINING etc and make sure it’s
exactsame as before and no duplicates etc To check Duplicates, check by basic code of columns, count(*) and having by > 1 – ALWAYS and then see rows (Or quick use excel if available/feasible depending on data sizes )
Eg: Snippet below
SELECT contractor, agency, COUNT(*)
GROUP BY contractor, agency
HAVING COUNT(*) > 1
- One of the quirks of Greenplum is that when you do a JOIN on numeric fields you have to cast the join fields for performance reasons. For example:
left join lock.clients d on c.id = d.id should be …
left join lock.clients d on c.id::bigint = d.id::bigint
- Arithmetic operators (+, -, *, /) only work across columns on values in a given row. If you want to perform operations across multiple rows, use Aggregate functions.
- Not equal to can have 2 ways depending on syntax and SQL versions, use aptly (<> OR != )
- Count(column1) – all rows that are NOT null for column1. (not distinct ones, just not null ones)
- Having – used for conditions of aggregate functions as where cannot be used
- If you’d like to narrow the window from the entire dataset to individual groups within the dataset, you can use PARTITION BY
- The ORDER and PARTITION define what is referred to as the “window”—the ordered subset of data over which calculations are made.
- Note: You can’t use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a GROUP BY clause.
- NULL Values are a big problem here esp in large tables, hence coalesce used massively Coalesce – returns the first non-null argument. Null returned if all are null. Potentially use when you don’t know which one will be not null.
Below are some final notes on the Order of syntax and Window functions.
ORDERS of Syntax
In order below:
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
ORDER BY arranges the remaining rows/groups
LIMIT filters on the remaining rows/groups
Window Functions with Examples
– RANK() – one of the easiest ones. Below gets a rank of employees in each department based on salaries
RANK() OVER (PARTITION BY department ORDER BY salary DESC)
– Running total – nice example – also done in Mode, summing it up well
“take the sum of duration_seconds over the entire result set, in order by start_time.”
SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
– ROW_NUMBER() – gives basic set of numbers starting from 1. Can also give for each partition starting each time at 1 also. Can give different row
numbers to same set of values by order, but RANK won’t do that.
– RANK() – to be used to give ranks based on some order. Will give same rank to 2+ set of values based on their rows and on the ORDER by variable used.
Gives same ranks to set of values and then the subsequent value becomes the one after all same values gives
Eg: 2 2 2 , next will be 5 (3,4 skipped)
– DENSE RANK() – gives the same ranks like that of RANK() in case of similar order by variable situations but the subsequent value is just the next value
and no skipping is done
Eg: 2 2 2, next is 3 only
– ORDER by inside partitions would order by partitions only and outside would do it for overall