Doing SQL Data Analysis As A PM
- context: Data-Informed Product Management
- tools like Tableau/Looker are good for ongoing dashboards/reports - but when you see an anomaly, you have to go into exploratory/ad-hoc analysis!
But the adhoc/exploratory tools aren't great.
What I've used
- data store: production database or RedShift mirror or even a local copy (if the prod db is small enough)
- dBeaver (update: Sequel Ace for Mac)
- GoogleSheets (or MsExcel but blech)
- (update): Steampipe
Problems with this approach...
- some queries take a long time (e.g. multiple minutes), which kills your flow
- by default, dBeaver overwrites your result-set-view each time, so if you want to "go back" you have to re-run the previous (slow) query
- you're typically writing multiple queries in a single window, and executing just the selected query. But the result set isn't clearly associated with the specific query that generated it.
- you might even start editing that query, so the query that generated the result set on the page doesn't exist anymore
- at some point you probably copy a result set over to your spreadsheet, and do additional analysis (simple calculations, pivot tables, deleting garbage rows (the latest month-aggregate because it's a partial month), etc. Then do some graphing from part of a spreadsheet area
- your spreadsheet doesn't link back to the query that made the result-set you pasted over
My hacked-up process
- I make a (Jira) ticket for an analysis
- I add a comment for each chunk of work
- I paste my query into the comment
- some time in 2021 a Jira update resulted in those pastes being double-spaced which is irritating
- I paste a link to the gsheet tab into the comment
- Sometimes I even find a good place to link from the gsheet tab back to the ticket
- I summarize my conclusions back up in the ticket body itself
- I share those conclusions by copy/pasting into Slack or Email, linking back to the ticket.
- (I've also had a couple cases where I downloaded a result-set as CSV and then did some Python coding to summarize it... but that's probably an outlier...)
- because of the manual steps, it's hard for anyone to re-run later
- also because you often have a constant like a date-range in queries
What I want
- something like Colab Notebook or nbdev? Glamorous Toolkit?
- cloud-based so I can let anyone read or comment
- process/structure that replaces Jira comments, ties narrative "here's what I'm going to check next" to query to result-set (auto-saving each one, though you can explicitly delete it)
- way to flag a step as being correct/updated next, so people can easily skip that transient step
- easy separation of parameters (date ranges) from queries using them
- pivot-table feature
- way to make the "wrap-around-bits" in a spreadsheet work as your replace the underlying result-set (which might result in a different number of rows) - does something like the old spreadsheet-script-records do that job?