Disclaimer: Let's get it out of the way. I am employed by Domo; however, the views and ideas presented in this blog are my own and not necessarily the views of the company.
In this post I'm going to show you how to use Domo with it's built-in data science functions to perform a KMeans clustering using the Hotels dataset from Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, 3rd Edition, by Gordon S. Linoff and Michael J. A. Berry (Wiley, 2011). Then I'll show you how you can do the same thing using Domo's platform to host the data and R to transform and analyze the data.
Why are we doing this?
From the platform perspective, my goal is to showcase how Domo can support Business Analytics and Data Science workflows.
From the business perspective, an organization may want to cluster Hotels to facilitate creating 'hotel personas'. These personas (ex. luxury business hotel versus weekend warrior favorite) may enable the creation of marketing campaigns or 'similar hotel' recommendation systems.
Disclaimer 2: I do not own the Hotels dataset.
The Domo Platform Organizes Datasets
Step 1: Upload the Hotels dataset to the Domo datacenter using the file connector.
Duration: < 2 minutes or 8-ish clicks.
Domo enables analytics workflows by assembling all the recordsets in one place. The data can be sourced internally from a data warehouse, POS system, Excel spreadsheet or SSAS cube or come from external sources like Facebook, Google Analytics, or Kaggle.
Once you've uploaded data to Domo you can apply Tags to keep track of your datasets or implement security to control who has access to your data (even down to the row-level).
Additional notes for InfoSec:
Given that the largest risk in data security is user behavior, finding solutions that are preferable to personal GitHub or Dropbox accounts, USB sticks or (God forbid) the desktop remains a priority. One of Domo's most understated highlights is its ability to surface IT governed and cleansed datasets to analysts in a single place that's highly accessible yet fortified with bulletproof (Akamai) security.
Notes for architects and engineers:
Under the hood, Domo's platform takes the best of big data / data lake architecture (distributed, high availability, etc. ) and (good) datamart recordset management techniques. Data workers familiar with Hadoop or Amazon web services will recognize Domo as a modular yet integrated big data stack wrapped in an easy to use GUI that business users will appreciate and get value out of from day one.
When you set up a trial of Domo you effectively have a free data lake that's ready for use -- complete with the ability to store literally millions of rows and/or gigabytes of data at rates you'd expect from Amazon S3 or Microsoft Azure storage at a fraction of the cost. Try it.
To the horror of every data scientist out there, in this blog, I'll skip data preprocessing, profiling or exploration and go straight to using Magic ETL to apply KMeans clustering on a set of columns in my data.
ETL stands for Extract, Transform and Load, and Domo provides a proprietary drag-and-drop interface (Magic ETL) for data transformation. Technically the data has already been extracted from your source system and loaded into Domo, so all we're left with is transform phase.
Double Side Note for Clarity:
Each DataSet is stored in Domo somewhere as a separate file.
Each DataFlow has Input DataSets and creates Output DataSets - new file(s)
You have 3 types of dataflows native to the Domo platform.
Magic ETL has a range of user-friendly set of Data Science functions. They are in Beta, so if you don't see them in Magic ETL, shoot me an email so we can get it enabled for you.
What is Clustering and why are we using it?
It is not my goal to make you a data scientist in a 5-minute blog post, but if you are interested, the Data Mining Techniques book I linked earlier may be a decent place to start.
Consider the following (completely fictitious) example:
"Jae, given your stay at the Threadneedles Hotel in London, you might like the Ace Hotel in Portland which has similar review rates by international guests, number of rooms and avg. room rates."
In short, clustering allows you to group entities (hotels) by sets of attributes (in our case, number of rooms, number of domestic reviews, international reviews, whether the hotel is independent etc.).
How easy is Magic ETL?
It's 1, 2, 3. Drag transformation functions (1) into the workspace (2), then define the function parameters (3).
In this example, I used the previously uploaded hotels-wiley dataset as an input, then I applied a K-Means function over a subset of columns.
Note: I explicitly defined the number of clusters to create (5).
Step 3: Preview and Output the Dataset for visualization or further analysis
In the final steps, we'll add a SELECT function to choose which columns to keep. In this case, we'll only keep columns that were used by the KMeans algorithm, as well as columns to identify the hotel (hotel_code).
Lastly, we add an Output Dataset function to create a new dataset in Domo which we can later visualize or use for further analysis.
Beware the false prophets ...
THIS IS IMPORTANT.
Do not fall into the trap of misinterpreting your results!
In the output dataset below, you'll see we have a new column cluster which happens to be next to a column bubble_rating. The hasty analyst might conclude: "oh hey, there appears to be a correlation between cluster and hotel ratings."
And all Data Scientists in the room #facepalm while #cheerForJobSecurity.
There are 5 clusters because in an earlier step, we told the algorithm to create 5 clusters. We could have easily created 3 or 7. There is not necessarily a correlation between which cluster a hotel ended up in and its rating. Keep in mind, cluster number is just an arbitrary numbering. If you re-run the function, ideally, you'll end up with the same groupings of hotels, but they could easily have different cluster numbers. (the hotel_cluster_3 could become hotel_cluster_4 and hotel_cluster_4 could become hotel_cluster_1)
Side Note which will become important later:
It would be nice if Domo included metrics for measuring separation between clusters or the strength of clusters. We arbitrarily told KMeans to create 5 clusters. But who knows, maybe there are really only 3 clusters. There are quantitative methods for identifying 'the right' number of clusters, but they aren't available in Domo out of the box.
Domo embraces all Analytic Tools
Let's do it again. In R.
As demonstrated, Domo has user-friendly data science functionality suitable for the novice data scientist; however, in real-world applications, analysts will likely use Domo's data science functions to build up and validate proof of concepts before transitioning to a 'proper' data science development platform to deliver analytics.
Domo has both Python and R packages that facilitate the easy data manipulation in full-fledged data analytics environments.
To extract data from Domo into R:
The R Code:
#install and load packages required to extract data from Domo
install.packages('devtools', dependencies = TRUE)
#initialize connection to Domo Instance
domo_instance <- ' yourDomoInstance '
your_API_key <- ' yourAPIKey '
#extract data from Domo into R
datasetID <- 'yourDataSetID '
## PROFIT!! ##
Earlier we asked the question, was 5 'the right' number of clusters.
Given the variables from before (number of rooms, number of domestic reviews, etc.), once NULL values have been removed and the variables scaled and centerered, it appears that 6 clusters may have been a better choice).
Side bar: READ THE MANUAL! Unless you read the detailed documentation, it's unclear how Domo's KMeans handles NULL values or whether any data pre-processing takes place. This can have a significant impact on results.
With the revelation that we should have used 6 KMeans clusters, we can either adjust our Magic ETL dataflow in Domo, or we can use R to create / replace a new dataset in Domo!
DomoR::create( cluster_df, "R_hotel KMeans")
In Domo's Magic ETL, we'll bind both cluster results to the original dataset for comparison.
Wrap it up
NEXT STEPS: Create 'Cluster Personas.'
Typically, organizations would use clusters of attributes to define 'hotel personas'.
From there, the outcome of this clustering and persona generation may influence discount packages or marketing campaigns to appeal to specific travelers.
REMEMBER: Clustering is not intended to predict Ratings! If you're still stuck on that, review the purpose of clustering.
In this quick article, I give the most cursory of overviews of how Domo's native (and beta) features can enable data science workflows. We also explored how other analytic workflows can integrate into the Domo offering.
If you're interested in finding out more, I can connect you with a Domo Sales representative (not me), or I'd love to talk to you about your analytic workflows and use case to see if Domo's platform might be a good match for your organization!
Disclaimer: the views, thoughts, and opinions expressed in this post belong solely to the author and do not necessarily reflect the views of the author's employer, Domo Inc.
Why this blog series exists:
When you buy an out of the box data warehouse & cube solution like Jet Enterprise from Jet Reports, www.jetreports.com or ZapBI, it usually meets the 80/20 rule (the stock solution satisfies 80% of your reporting requirements, and you'll need further customizations to meet your remaining analytics requirements).
Client requests from customers like you inspire this blog series.
Set a Default Value for Budget Amount
Problem: Some measures require filtering by a default value to show reasonable results.
Consider the example of the company that creates and revises a new budget each quarter. While it's perfectly reasonable to show the [Budget Amount] for March 2017 from the 'Initial 2017 Budget' or 'Q1 2017 Amended', it would be misleading to show the sum of both budgets for March 2017. Similarly, if your organization uses a forecasting tool to calculate expected item consumption or sales, you may have multiple forecasts numbers for the same item and month.
Solution: To minimize the risk of 'double counting', we'll modify [Budget Amount] to always filter by a default [Budget Name].
In this solution, we hard-coded a budget into the Default Member as part of the dimension definition. For improved maintenance, we could add an [IsDefaultBudget] attribute to the [Budget] dimension in our cube and data warehouse. Then reference the new column when defining the Default Member. Ideally our source ERP system can identify the current budget, or we can implement business processes and SQL code to auto-identify the correct budget.
Note: Because we can have multiple concurrent budgets, but shouldn't sum [Budget Amount] across different budgets, the measure is a semi-additive fact -- in Business Intelligence nomenclature semi-additive facts cannot be summed across all dimensions. Other common examples include balance, profit percent, average unit price, company consolidation, as well as most cost-allocation schemes. When customizing our data warehouses and cubes, we must be careful where we implement the calculation to avoid reporting errors.
Non-additive facts cannot be summed (ex. [First Invoice Date]). Fully-additive facts are 'normal' facts which can be summed across all dimensions.
Invert the sign of Income Statement Accounts
Problem: Sometimes the dimension used affects how we want to calculate a measure.
In many ERP systems the [Amount] or [Quantity] column have counter-intuitive signs. Consider G/L Entries which have negative signs for revenue and positive signs for expenses or inventory transactions which show a negative sign for item sales and shipments and positive signs for item purchases. While the data makes sense to accountants and analysts, business users frequently appreciate seeing signs that intuitively make sense.
In cube-based reports, we'd like to invert the sign on [Amount] for all Income Statement accounts but not Balance Sheet accounts. Note, it's arguably 'wrong' to invert the sign in the data warehouse because that would make our [Amount] column semi-additive, and potentially cause problems for our auditors!
Solution: In the cube, use a SCOPE statements to assign a different calculation to [Amount] based on the [GL Account] dimension.
Note: in SSAS cube nomenclature we typically deal with two out of three measure types: Standard and Calculated (avoid Derived measures when possible).
A Standard measure is typically a SUM or COUNT of a column that exists on the fact table in the data warehouse: ex. sum(Amount) or sum(Quantity).
A Calculated measure is an expression that typically uses standard measures in conjunction with MDX expressions and functions. Typical examples include division (profit margin percent or avg. daily sales) as well as all business functions (Year To Date, Balance, or Previous Period).
A Derived measure is a calculation based on fields from the same fact table -- ex. Profit Margin ([Sales Amount] - [Cost Amount]) When possible, avoid derived measures and use standard measures instead-- just add the calculation as a new column in the data warehouse. Using standard instead of derived measures has two major advantages:
That's All Folks
Remember, out of the box solutions by definition are designed to work for ALL customers. Further customization is expected, even required! Onyx Reporting is available for development services, training and mentoring as you implement minor (if not major) customizations to your Jet Enterprise implementation.
For clients not using Jet Enterprise, please contact us to find out how data warehouse automation can cut your business intelligence implementation time from weeks and months into days.
If you're anything like me, you've got multiple copies of NAV stored on your SQL server from various clients and if you forget to write down which version of NAV they're on, that can be a pain.
Fortunately, table structure doesn't change THAAAAAT much ;) but just in case:
[NAV Database Version] = CASE
WHEN [databaseversionno] = 40
THEN 'Navision 4 SP2'
Last week we explored 6 dashboard design tips to improve the aesthetics of existing dashboards. In this post, we share design tips from our analysts about dashboard and OLAP cube requirements gathering as well as costly mistakes to avoid.
12 Tips to Remember
Tailor your dashboard to the audience AND the process they're trying to optimize.
5 Tips to Avoid
Kicking off a project with an overly complex problem quickly leads to project stall or paralysis.
Go Forth and do Great Things
Still looking for inspiration?
What exactly are "multiple environments?" It's an infrastructure that allows your business to separate BI development endeavors from the live environment the organization uses for reporting and analytics.
"Why would I need it?" Development is not always a quick 5 or even 15-minute fix. Larger projects can take days, weeks, even months to complete, so you need a sandbox for the developers to 'play' in while the rest of the organization continues forward with 100% uptime on their live reporting environment. Some organizations may even choose to separate the Dev sandbox from QA (Quality Assurance) efforts, so a third Test environment may be needed!
But "do I need multiple environments?" As with any business question, 'it depends'. If your development is limited to just adding a field here or a new measure there during the initial implementation of your business intelligence project, you may be able to wiggle by without separated environments.
It may make sense to separate development from the live environment if:
It may make sense to implement a QA environment if your organization has
Ready to get started today?
This presentation is preparation for a training series Onyx Reporting will be conducting in 2017! Join our mailing list to keep up with special offers, training opportunities and our blog series.
At two conferences in the same month, I overheard CTO's express, that enterprise resource planning (ERP) systems like Dynamics NAV, GP or Sage could only capture what happened inside a company, and to uncover the 'why', organizations would have to consider data collected outside their ERP (Google Analytics, CRM, webstore, Mailchimp, Twitter etc.)
Ex. Data analysts can build sales or marketing spend report based on data stored in NAV with aplomb. They can even ascertain product and customer trends, but without external data, it'd be difficult to ascertain why organizations are seeing the behavior or more importantly, how they can influence behavior.
This blog post will review
The Data Discovery Hub facilitates data integration by separating ETL (extract, transform, load) into three databases: ODS, DSA, MDW
In an effort to escape 'Excel-anarchy' many organizations transition away from error-prone spreadsheet-driven reporting toward governed enterprise data warehouses which are expected to provide a 'single-version-of-the-truth' for all reporting and analytics endeavours. To that end, during the implementation process, bespoke business rules and constraints are applied to the data warehouse to enforce the consistency and validity of reported data.
Although data warehouse applications (like Jet Report's Jet Data Manager) simplifies the process of applying business rules to data sets, the majority of implementations Onyx Reporting encounters do not report & monitor outlier records that fail data validation constraints. Translation: you may be truncating data and not know it. Double Translation: You may be making decisions based on incomplete data! That's not to say that supporting tools within the product ecosystem don't exist! They just frequently go under utilized.
Every Data Warehouse is built on (un)declared Business Rules
The expectation that business keys (like Customer No or G/L Account No) will uniquely identify one member of a dimension is virtually universal. Though a semi-obvious requirement, particularly when organizations are integrating data from disparate sources (multiple companies, a legacy system, web-based or unstructured data sources), conflicts between the expected versus actual data reality can arise. Additionally, there can be a disconnect between expected values (as laid out by standard operating procedures) and the actual values recorded in the ERP systems (ex. every sales event should be attributed to a Salesperson or Geographic Region).
To close the loop, and prevent your organization from making strategic decisions based on incomplete information, system implementors must add measures and controls for monitoring and correcting records that fail data validation.
Join our mailing list to be informed when we publish new articles.
Author Jae Wilson, lead data strategist at Onyx Reporting, partners with co-author Joel Conarton, director at executive and management consultancy Catalystis, to provide strategic solutions for data-driven organizations.
Are you ready to partner with us?
As I create new How-To guides specific to data warehouse automation, I'll add them here.
Fuzzy Look-ups allow business to match and conform data to a reference set.
In this video, we use the Jet Data Manager, a data warehouse automation tool, in conjunction with SSIS and Visual Studio to rapidly implement a data quality architecture.
In the last decade, software developers addressed the barriers to comprehensive data programs by developing robust data warehouse automation tools that automated code generation for recurring tasks in data projects. Although the solution had obvious benefits to the developers hacking out code in the basement, the value proposition was unclear for business executives. "You want me to spend 100k on what?"
Assumption: To remain competitive organizations must leverage data to augment their product or services offering and/or use analytics to reduce costs or innovate value-added business processes.
Why do I need a governed data warehouse? Can't I stick with my self-service solution?
No, self-service business intelligence (BI) will not replace a data warehouse. Your organization needs:
We still need self-service tools. Data-driven innovation frequently stems from exploration by individuals or small teams before transitioning into enterprise-wide solutions. The cutting edge of data innovation lies at the intersection of self-service flexibility and agile datastore implementations serviced by an automation tool.
--Update 4/10/2016 --
Onyx Reporting uses and recommends data warehouse automation tools developed by TimeXtender; because while the application does auto-generate code, all parts of the business intelligence development process (extract, transform and load) are accessible and customizable using traditional tools in the Microsoft BI stack.
This two-part article describes a 6-step framework, Vision, Mission, Strategy, Goals, Initiatives, Actions (abbreviated VMSGIA) executives can leverage for refocusing business innovation around strategic goals.
Onyx Reporting uses this methodology during the needs assessment phase of larger business intelligence (BI) and data strategy projects to prioritize, frame, and deliver high-value analytics solutions. Over the next two posts, we will:
By Jae Myong Wilson - keep abreast with BI strategy from the comfort of your inbox.
A Data Project Sunk in Dry-dock
Piecemeal design produces hamstrung data strategy teams.
As organizations evaluate new analytics tools, the question "Do you have any sample dashboards?" invariably arises. Though it seems a reasonable request, in most cases, it derails the data project team; because the process rapidly devolves into piecemeal design and never recovers. "I don't like the layout of this report" "How much would it cost to change this feature?" "How many hours would it take to add a new calculation?"
Instead of proactively designing a comprehensive solution, the data consultancy is relegated to reactively implementing fixes.