top of page

Visualizing the Difficulty of Wordle through Twitter Data

  • toldham2
  • Feb 7, 2022
  • 1 min read

Updated: May 14, 2023

Collecting and Cleaning Twitter Data to Showcase the Difficulty of this Week's Wordle

Data Scraped from a sample of 13,000 publicly shared tweets, ~2000 tweets per day from English-speaking users.  Note that this is not representative of all scores, but solely the scores actively posted on Twitter.  Analysis I hypothesized that the scores would be skewed-right, because people would be more likely to post better scores and more embarrased about lower scores. The fact that the distribution was normal shows that users are less reserved about their scores as I previously anticipated.
Analysis The difficulty of PERKY is less than surprising due to the lack of common usage for that word in English. The same applies to the relative ease of THOSE as that word is highly common. MOIST, on the other hand, was surprisingly easy, which is surpising as the word has very niche and limited uses in English.

For those of you living under rocks, Wordle is the universal word puzzle that's sweeping the internet. The game, recently purchased by the New York Times, gives players 6 chances to guess a 5-letter word, giving only hints in the form of Battleship-Esque colored blocks. The kicker? Everyone in the world has the same word each day, so you can compare your score directly to your friends and family.


My favorite part about it, however, is the game's sharing features. When a user goes to share their score, the game generates a pre-formatted message containing the game id, your score, and an emoji matrix of blocks to represent your guesses without revealing the answer.


As soon as I saw this, I knew the consistent formatting would make for a great dataset, and I was right!


I used the Twitter API as well as the rtweets and tidyverse R packages to collect and clean the data. I visualized the data using Adobe Illustrator's built-in graph tool.

Code

#### Library ####

library(rtweet)

library(tidyverse)


#### Get Data ####

master <- data.frame()


Sys.sleep(900) # Wait 15 minutes to ensure no API timeouts


for (i in 223:231){ # Loop to collect and combine tweets for the past ~9 games

query <- paste0('"Wordle ', i, '"')

message(paste0("\n", Sys.time(), ": Getting scores for game ", i, "..."))

tweets <- search_tweets(query,

n = 1999,

include_rts = FALSE,

type = "recent")

score <- str_sub(tweets$text, 12, 14)

actual_score <- str_sub(tweets$text, 12, 12)

game_id <- str_sub(tweets$text, 8, 11)

date <- str_sub(tweets$created_at, 1, 11)

scores <- data.frame(tweets$screen_name, score, actual_score, date, game_id, tweets$lang)

scores <- filter(scores, grepl("/", score))

scores <- filter(scores, !grepl(" ", score))

scores <- filter(scores, grepl("en", tweets.lang))

master <- rbind(master, scores)

}


#### Clean Data ####

master <- filter(master, grepl("1/6|2/6|3/6|4/6|5/6|6/6|X/6", score))

master$actual_score <- gsub("X", "7", master$actual_score)

master <- filter(master, grepl("223|224|225|226|227|228|229|230|231", game_id))


write.csv(master, "~/wordle_scores_223-231.csv")


Comentarios


bottom of page