SAS Day to Data: Filling in the Blanks

FILLING IN

THE BLANKS

CAN VISUALIZATIONS HELP A DATA NEWBIE UP HIS WORD-GAME SKILLS?

By Evan Markfield

FILLING IN THE BLANKS

CAN VISUALIZATIONS HELP A DATA NEWBIE UP HIS WORD-GAME SKILLS?

By Evan Markfield

What happens when you turn a writer loose on data visualization tools with exactly zero experience? I decided to find out by exploring data based on my current obsession: playing one of the more challenging fill-in-the-blank games on the internet.

That game is Redactle, a daily puzzle that presents a Wikipedia article with most of the words redacted. Players guess words until they uncover the article’s title, resulting in scores for total guesses and accuracy.

At first glance, all those blank spaces are … daunting.

After guessing wildly on my first few puzzles, I eventually focused on lowering my score by using context clues in the blanks (and subsequently explaining to anyone who would listen how I got “Angola” in one guess).

That led me to wonder: Were there article types I was better at identifying than others? Did certain topics happen on certain days? What other factors might help lower my scores?

The game already provided a history of my scores. And since I spend all day writing about all the awesome stuff people do with data, the solution was obvious: Ask a colleague to teach me how to use SAS® Visual Analytics so I could dive in.

LET’S GET VISUAL

The first thing I wanted to explore was the article’s topic areas, in case that might help me aim my initial guesses each day at more common topics. So I added that column to my data, and with the training wheels still firmly on, I dove into creating the following pie chart and bar chart.

It got a bunch of pretty colors, but the distribution of topics mostly balances out as you’d expect. Interesting to confirm the suspicion that you'll see a lot of science, people and places, for example, but knowing that didn't produce any advantage I could discern.

So it was time to go a new direction: Maybe seeing my performance (average number of guesses) by topic would provide more insight.

Interesting, but again, not necessarily telling me anything I could use. Drill down to the second level of that chart, and it might appear I’m great when it comes to literature and sports (subtopics of Society). But drill down one more time, and you’ll see they only account for a grand total of five puzzles. Turns out “Places” and “People” are much better indications of categories where I’m consistently better over a larger sample size.

The colleague instructing me, SAS solutions architect Steve Mellgren, provided a nugget of wisdom: “When you're doing your dashboard, 80% of your work is the data.” So that was it: I needed more data!

I added columns for day of the week, the word count of the article’s title and whether I’d used Google for help solving it. While my data was still neither “long nor wide” – in what I’m told is the data science parlance – I now had more options for slicing and dicing things.

A DEAD END AND A LESSON

The question was how to approach this new data in a way that would tell me something. So I began with a bar chart of articles by day of the week. Was I really at my best on Sunday and Monday, like the bar chart said? I'm not my best anything on a Monday, so one eyebrow instinctively shot upward.

Also, I really felt in my gut like I was killing it on Fridays, which the chart said was my third-best day by average. There had to be an explanation in the data.

Welcome to a common data problem: Outliers! Despite now being pretty adept at creating new dashboards and exploring the data myself, I went back to Steve and asked for advice.

Enter the box-whisker plot. I have two cats with an affinity for hanging out in cardboard boxes, so I figured this was in my wheelhouse. But apparently, Steve was talking about a visualization that shows you where these outliers are in relation to the rest of your data. Helpful, albeit less adorable. (So, the opposite of cats.)

This certainly answered my Friday question: It actually was a generally solid day for me, but my 306 guesses trying to solve “injunction” single-handedly tanked my average. (No one tell my attorney wife, please.)

Did the box and whisker tell me anything that would improve my game? Probably not, but it did teach me that diving into data differently can prevent misleading conclusions (for instance, that you defy common Monday logic).

IS THAT … AN INSIGHT?

That doesn’t mean I didn’t find anything useful in my data adventure. Word count turned out to be a solid predictor of performance.

This dual-axis bar chart shows how the number of guesses steadily goes down (and accuracy goes up, albeit less dramatically) from one-word answers to four-word answers.

One-word articles are the most common and could be just about anything, like rope (174 guesses), foam (194 guesses) or monosaccharide (196 guesses). More possibility equals more guesses and lower accuracy.

Three- and four-letter titles have the benefit of an occasional “free” word given to you. For example, one day when I saw “[7-letter blank] the [9-letter blank]” and birth/death date blanks in the first sentence, I knew the title was a dead person. Based on letter count, I made my guess: William the Conqueror – a three-word title solved in two guesses.

The lesson: When I see multi-word titles, I look extra hard for context clues that might lower my score.

THE SEARCH FOR FEWER GUESSES

While I try to resist the urge, sometimes search-engine assistance is required. My assumption was that using that crutch would correlate with better scores.

But look at this butterfly chart and you’ll notice that on all but a couple of topics, my average number of guesses is actually lower when I don’t use search.

I hadn’t thought of it before seeing this visualization, but it makes total sense: I rely on help when I’m already struggling. Without exploring the data, I never would have considered this now-obvious insight.

QUESTIONS AND ANSWERS … AND MORE QUESTIONS

So what is my big takeaway from this deep dive into my game data? Nothing that’s likely to have me drastically lowering my average number of guesses, to be honest.

The thing that made this worthwhile was not so much finding answers but realizing that exploring these visualizations had me discovering better, more interesting questions to ask the data. It felt like a creative exercise – something a writer is more accustomed to – because it was an iterative process, not just a black-and-white look at what the data “says.”

That’s why this will be the first in a series called “Day to Data” – data stories from non-data-scientist colleagues as they look for insights in their daily lives and activities. It will be a chance for them to explore their personal passions or interests in a whole new way – all by following the data.

ABOUT THE AUTHOR

Evan Markfield

Evan Markfield is Social Innovation Editorial Director at SAS and manages curiosity.sas.com. He is still probably better at word games than data visualization, but at least now he now knows what a linear regression is.

RECOMMENDED FOR YOU

WELLNESS

EXPLORING INEQUITIES IN THE HIV FIGHT

See how inequalities have allowed the HIV epidemic to continue raging in parts of the world.

→ READ MORE

GEN Z

GROWING UP IN THE GAP

One woman's look at data visualizations to see the lifetime impact of gender pay inequality

→ READ MORE

Curious about SAS and the analytics that empower organizations everywhere?

GET TO KNOW US