Disclaimer: Let's get it out of the way. I am employed by Domo; however, the views and ideas presented in this blog are my own and not necessarily the views of the company.
In this post I'm going to show you how to use Domo with it's built-in data science functions to perform a KMeans clustering using the Hotels dataset from Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, 3rd Edition, by Gordon S. Linoff and Michael J. A. Berry (Wiley, 2011). Then I'll show you how you can do the same thing using Domo's platform to host the data and R to transform and analyze the data.
Why are we doing this?
From the platform perspective, my goal is to showcase how Domo can support Business Analytics and Data Science workflows.
From the business perspective, an organization may want to cluster Hotels to facilitate creating 'hotel personas'. These personas (ex. luxury business hotel versus weekend warrior favorite) may enable the creation of marketing campaigns or 'similar hotel' recommendation systems.
Disclaimer 2: I do not own the Hotels dataset.
The Domo Platform Organizes Datasets
Step 1: Upload the Hotels dataset to the Domo datacenter using the file connector.
Duration: < 2 minutes or 8-ish clicks.
Domo enables analytics workflows by assembling all the recordsets in one place. The data can be sourced internally from a data warehouse, POS system, Excel spreadsheet or SSAS cube or come from external sources like Facebook, Google Analytics, or Kaggle.
Once you've uploaded data to Domo you can apply Tags to keep track of your datasets or implement security to control who has access to your data (even down to the row-level).
Additional notes for InfoSec:
Given that the largest risk in data security is user behavior, finding solutions that are preferable to personal GitHub or Dropbox accounts, USB sticks or (God forbid) the desktop remains a priority. One of Domo's most understated highlights is its ability to surface IT governed and cleansed datasets to analysts in a single place that's highly accessible yet fortified with bulletproof (Akamai) security.
Notes for architects and engineers:
Under the hood, Domo's platform takes the best of big data / data lake architecture (distributed, high availability, etc. ) and (good) datamart recordset management techniques. Data workers familiar with Hadoop or Amazon web services will recognize Domo as a modular yet integrated big data stack wrapped in an easy to use GUI that business users will appreciate and get value out of from day one.
When you set up a trial of Domo you effectively have a free data lake that's ready for use -- complete with the ability to store literally millions of rows and/or gigabytes of data at rates you'd expect from Amazon S3 or Microsoft Azure storage at a fraction of the cost. Try it.
To the horror of every data scientist out there, in this blog, I'll skip data preprocessing, profiling or exploration and go straight to using Magic ETL to apply KMeans clustering on a set of columns in my data.
ETL stands for Extract, Transform and Load, and Domo provides a proprietary drag-and-drop interface (Magic ETL) for data transformation. Technically the data has already been extracted from your source system and loaded into Domo, so all we're left with is transform phase.
Double Side Note for Clarity:
Each DataSet is stored in Domo somewhere as a separate file.
Each DataFlow has Input DataSets and creates Output DataSets - new file(s)
You have 3 types of dataflows native to the Domo platform.
Magic ETL has a range of user-friendly set of Data Science functions. They are in Beta, so if you don't see them in Magic ETL, shoot me an email so we can get it enabled for you.
What is Clustering and why are we using it?
It is not my goal to make you a data scientist in a 5-minute blog post, but if you are interested, the Data Mining Techniques book I linked earlier may be a decent place to start.
Consider the following (completely fictitious) example:
"Jae, given your stay at the Threadneedles Hotel in London, you might like the Ace Hotel in Portland which has similar review rates by international guests, number of rooms and avg. room rates."
In short, clustering allows you to group entities (hotels) by sets of attributes (in our case, number of rooms, number of domestic reviews, international reviews, whether the hotel is independent etc.).
How easy is Magic ETL?
It's 1, 2, 3. Drag transformation functions (1) into the workspace (2), then define the function parameters (3).
In this example, I used the previously uploaded hotels-wiley dataset as an input, then I applied a K-Means function over a subset of columns.
Note: I explicitly defined the number of clusters to create (5).
Step 3: Preview and Output the Dataset for visualization or further analysis
In the final steps, we'll add a SELECT function to choose which columns to keep. In this case, we'll only keep columns that were used by the KMeans algorithm, as well as columns to identify the hotel (hotel_code).
Lastly, we add an Output Dataset function to create a new dataset in Domo which we can later visualize or use for further analysis.
Beware the false prophets ...
THIS IS IMPORTANT.
Do not fall into the trap of misinterpreting your results!
In the output dataset below, you'll see we have a new column cluster which happens to be next to a column bubble_rating. The hasty analyst might conclude: "oh hey, there appears to be a correlation between cluster and hotel ratings."
And all Data Scientists in the room #facepalm while #cheerForJobSecurity.
There are 5 clusters because in an earlier step, we told the algorithm to create 5 clusters. We could have easily created 3 or 7. There is not necessarily a correlation between which cluster a hotel ended up in and its rating. Keep in mind, cluster number is just an arbitrary numbering. If you re-run the function, ideally, you'll end up with the same groupings of hotels, but they could easily have different cluster numbers. (the hotel_cluster_3 could become hotel_cluster_4 and hotel_cluster_4 could become hotel_cluster_1)
Side Note which will become important later:
It would be nice if Domo included metrics for measuring separation between clusters or the strength of clusters. We arbitrarily told KMeans to create 5 clusters. But who knows, maybe there are really only 3 clusters. There are quantitative methods for identifying 'the right' number of clusters, but they aren't available in Domo out of the box.
Domo embraces all Analytic Tools
Let's do it again. In R.
As demonstrated, Domo has user-friendly data science functionality suitable for the novice data scientist; however, in real-world applications, analysts will likely use Domo's data science functions to build up and validate proof of concepts before transitioning to a 'proper' data science development platform to deliver analytics.
Domo has both Python and R packages that facilitate the easy data manipulation in full-fledged data analytics environments.
To extract data from Domo into R:
The R Code:
#install and load packages required to extract data from Domo
install.packages('devtools', dependencies = TRUE)
#initialize connection to Domo Instance
domo_instance <- ' yourDomoInstance '
your_API_key <- ' yourAPIKey '
#extract data from Domo into R
datasetID <- 'yourDataSetID '
## PROFIT!! ##
Earlier we asked the question, was 5 'the right' number of clusters.
Given the variables from before (number of rooms, number of domestic reviews, etc.), once NULL values have been removed and the variables scaled and centerered, it appears that 6 clusters may have been a better choice).
Side bar: READ THE MANUAL! Unless you read the detailed documentation, it's unclear how Domo's KMeans handles NULL values or whether any data pre-processing takes place. This can have a significant impact on results.
With the revelation that we should have used 6 KMeans clusters, we can either adjust our Magic ETL dataflow in Domo, or we can use R to create / replace a new dataset in Domo!
DomoR::create( cluster_df, "R_hotel KMeans")
In Domo's Magic ETL, we'll bind both cluster results to the original dataset for comparison.
Wrap it up
NEXT STEPS: Create 'Cluster Personas.'
Typically, organizations would use clusters of attributes to define 'hotel personas'.
From there, the outcome of this clustering and persona generation may influence discount packages or marketing campaigns to appeal to specific travelers.
REMEMBER: Clustering is not intended to predict Ratings! If you're still stuck on that, review the purpose of clustering.
In this quick article, I give the most cursory of overviews of how Domo's native (and beta) features can enable data science workflows. We also explored how other analytic workflows can integrate into the Domo offering.
If you're interested in finding out more, I can connect you with a Domo Sales representative (not me), or I'd love to talk to you about your analytic workflows and use case to see if Domo's platform might be a good match for your organization!
Disclaimer: the views, thoughts, and opinions expressed in this post belong solely to the author and do not necessarily reflect the views of the author's employer, Domo Inc.
Sorry it's been so quiet! Been super busy.
Here's a quick 'report doctor' tutorial based on a report sent in by an end user Jorge from my Youtube Channel. He was looking my tutorial on NL(First) and posted in the comments that he was unable to figure out why his report still wasn't working as expected.
After a little poking around, I saw a couple opportunities for improving his report as well as spotted why the report wasn't working as expected.
This video covers:
If you're using the [Inventory Turnover] report from the Jet Reports, Report Player you may be drawing the wrong conclusions because THE REPORT HAS AN ERROR IN IT.
If you use the [Location Code] slicer, you may assume that you're looking at the inventory from THAT location; however, due to the design of the report, you're actually looking at inventory across ALL locations.
Easy solution? Recreate the report in Jet Enterprise.
At 40+ minutes, the tutorial is a mini-consulting session! But if you're strapped for time, I've included hyperlinks that function as chapter markers.
Use Excel's GetPivotData function: https://youtu.be/h2Si6xHmWbw?t=262
Show Calculated Fields in Pivot Tables: https://youtu.be/h2Si6xHmWbw?t=604
The Inventory Turnover Report IS WRONG: https://youtu.be/h2Si6xHmWbw?t=781
Building KPIs in Jet Data Manager: https://youtu.be/h2Si6xHmWbw?t=1107
Visualizing your cubes in Excel #PivotTables: https://youtu.be/h2Si6xHmWbw?t=1712
When Rollups make your KPIs 'wrong': https://youtu.be/h2Si6xHmWbw?t=1798
Visualizing your data in PowerBI: https://youtu.be/h2Si6xHmWbw?t=2293
If this is all greek to you and you want someone to review your existing Jet Enterprise implementation (there's a new update to Jet Data Manager) , give me a holler at email@example.com or shoot us a message through the website. Our team is available for training, development services and support remotely as well as onsite!
The new update to Jet Data Manager is finally here and Harry L., the lead KB article writer over at Jet Reports, www.jetreports.com, has been super prolific in documenting the updates! Side bar: the Change Log hasn't been updated yet, but I'll post about it as soon as it does.
If you're on an older build of JDM (anything less than 17.5) today is the day to update because the features and performance improvements are HUGE.
There are a grip of new KB articles but I'll just share the ones I think most relevant to me and my Jet Enterprise clients:
If most of these articles are greek to you, that's fine, give me a holler at Jae@OnyxReporting.com, or open a chat window and I'd be more than happy to give a free assessment as to whether the upgrade to JDM 17.5 is appropriate for you!
One of the selling points of Jet Enterprise / TimeXtender is how easy it is to create cubes that allow you to quickly and flexibly analyze data that rolls up by hierarchies (think [Sales] rolled up by [Date YQMD] or by [Customer] > [State] > [Region].
Traditional Business Intelligence developers and SQL DBAs know and recognize this as 'User-Defined Hierarchies' traditionally implemented in Visual Studio or SQL Data Tools. Furthermore, your developer probably knows a handful of ways to optimize SSAS performance which can / should be duplicated in Jet Enterprise.
This tutorial shows how to analyze your cube for user-hierarchy optimization opportunities and then how to implement them in JDM via attribute relations.
For an in-depth introduction to attribute relations in SSAS multidimensional, check out, MS SQL Tips.
After my last video and blog post on 5 ways to sort Jet Reports, Mario from Jet Reports Spain asked for a tutorial on the "sort on sum" technique. So I made one!
Full disclosure. This is not an easy technique! It requires a little knowledge about using double quotes, ", in Excel strings, as well as adding calculated fields inside of Jet replicators using NF functions. It may seem a little confusing, but if you give me 40 minutes of your attention, hopefully it'll all make sense.
Fair Warning: Only use calculated fields in your report if you absolutely need it (because it reduces report performance).
Like what you saw?
Have reports that require construction?
Give us a holler.
If your Jet Report has ever suddenly stopped working ("it was fine and all of a sudden...") many times it can be attributed to behaviors that can cause spreadsheet corruption. The most common culprit? Changing the structure of the spread sheet ex.
In the follow 45 minute tutorial, I demonstrate 5 alternatives that will NOT corrupt your workbook.
Warning: Shameless Plug.
This presentation was inspired by a client who emailed me just before the weekend. Feel free to pass your questions along, who knows, you may get an extended tutorial out of it!
Alternatively, if you need onsite training and/or a bunch of reports developed, we do have a summer sale for onsite days with additional discounts for non profit organizations.
Interested? Give us a holler.
This tutorial takes a Jet Professional report and shows how to implement time-based performance metrics in the Jet Enterprise standard NAV project.
In case you missed the original post, I've also included the video of the mentoring session I did for the original Jet Professional Report.
In the tutorial I review using the Company= argument for multi-company reporting, and include information about report optimization using Data Dumps / Record Keys instead of a slew of NL("First") functions.
Hope you enjoy!
This tutorial highlights 3 ways of visualizing data using the Jet Reports product suite.
This tutorial will show you:
This year I, Jae Wilson, had the pleasure of representing Onyx Reporting at CollisionConf 2017, the largest growing tech conference in the United States. While this blog is usually reserved for business intelligence related topics, I did want to share some tweet-able things I heard at Collision.
"What is the half life of data?" How rapidly does the value of your data decay, and what systems are you putting in place to capture and analyze it?
IOT-ready platforms (internet of things) and real-time analytics may not be high priority concerns for your organization, but all companies regardless of industry or size should be examining how they acquire, analyze and act on data.
Jet Reports, PowerBI, and TimeXtender are platforms that can accelerate data acquisition, integration, and analysis. If you need support with implementation, Onyx Reporting has a summer sale for services, training and upgrades.
"Allow your audience to see themselves in your shoes."
"Keep asking Who The F$@K cares about your service."
These two nuggets of wisdom jumped out at me because I regularly ask my clients "why do your customers stay with you?" and I'm equally regularly surprised by the answers.
"We offer high-quality and personalized customer service which our loyal customers love," my client told me. Curiously, as far as I know, they haven't implemented any customer satisfaction metrics, nor do they analyze why customers churn.
Thought for the Day