analyses

The most popular random seeds

Stefan Seemayer

Sep 25, 2015 • 4 min read

When developing programs that rely on some sort of randomness, it is often a good idea to make the random number generator behave more deterministically. For example, you could have a procedurally generated game generate its content based on a random seed that could be exchanged by players to play through the same experience. For randomized algorithms, a constant seed can help with debugging.

Since the value of the random seed doesn't really matter, the programmer seeding the random number generator is free to pick any number that they like.

(image credit: Cyanide and Happiness)

With that said, I was wondering if there are random seed numbers that are preferred by coders (0? 1? 42? 1337? 31337?). Thanks to the huge amount of open source code and code search engines that index it, we can find out easily!

Seed search terms

Search engine require a search term to find in their database. To include some of the most popular programming languages, I've come up with this set of canonical ways of seeding the respective random number generators:

random.seed(111)    # Python; also covers np.random.seed, numpy.random.seed
Random.new(111)     # Ruby
set.seed(111)       # R
new Random(111)     # Java, C#
srand(111)          # C/C++, Objective-C, PHP, Perl

Getting the Data

Now that we have some search terms to look for, we need a code search engine that will allow us to query for code, ideally providing a REST API that we can use. This turned out a little trickier than expected, though. The GitHub API will only let us search for code when specifying an author or repository. Open Hub (formerly Ohloh) doesn't have API access (or I couldn't find it). I've finally found that the searchcode API will let us query for code, but there is a limit of 1000 results returned per query (10 pages of 100 results each).

After consulting the searchcode API, I was able to come up with the following shell script to download search results for all our search terms, save them to .json, extract the actual sourcecode to a .txt file using jq and generate tab-separated data with file extension (to determine the programming language) and the used random seed:

#!/bin/bash

QUERIES=('random.seed' 'Random.new' 'set.seed' 'new Random' 'srand')

for i in "${!QUERIES[@]}"; do
	query="${QUERIES[$i]}"

	page=0
	while [ "$page" != "null" ]; do
		url="https://searchcode.com/api/codesearch_I/?q=${query/ /+}&per_page=100&p=${page}"
		jsonfile="qry-${i}-page-${page}.json"
		matchfile="qry-${i}-page-${page}.txt"
		filterfile="qry-${i}-page-${page}.csv"

		# download from API
		if [ ! -f ${jsonfile} ]; then
			echo "Getting '${url}'"
			curl -o ${jsonfile} "${url}"
		fi

		# extract filename and lines fields from JSON
 		jq -r ".results | map(.filename + \"\t\" + (.lines | to_entries[].value))[]" ${jsonfile} > ${matchfile}

		# get filename suffix and match
		(
			while read fn matchline; do
				regex="${query}\\s*\\(([0-9]+)\\)"
				suffix="${fn##*.}"
				match=$(echo ${matchline} | pcregrep -o1 -i "${regex}")
				if [ ! -z "${match}" ] && [ ! -z "${suffix}" ]; then
				echo -e "${suffix}\t${match}"
				fi
			done
		) < ${matchfile} > ${filterfile}

		# determine next page
		page=$(jq '.nextpage' ${jsonfile})
		echo "next page for '${query}' is ${page}"
	done
done

The 50 API requests complete fairly quickly. We receive a set of tab-separated files with the following format:

js	0
py	3
py	23456

We now combine all files into a master seed file and add a header to the table:

(echo -e "suffix\tseed"; cat qry-*.csv) > seeds.csv

Plotting

Now that we have all the data in a nice, clean format, it's time to visualize it! Hadley Wickham's ggplot2 package for R has both a beautiful API and generates nice plots without much work:

#!/usr/bin/env Rscript
library(ggplot2)
library(plyr)

# load processed data
df = read.table("seeds.csv", header=T)
df$seed = factor(df$seed)

# filter results to only include seeds that have been observed at least 10 times
df = ddply(df, .(seed), function(df) { if(nrow(df) >= 10) { df	} })

# count occurrence of seeds
df = ddply(df, .(suffix, seed), function(df) {
	data.frame(suffix=df$suffix[1], seed=df$seed[1], count=nrow(df))
})

# plot
ggplot(df) + 
	geom_bar(aes(x=seed, y=count, fill=suffix), stat="identity") + 
	theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
	xlab("Random Seed") + 
	ylab("Number of Observations")

Note that we're removing all seeds that have been observed less than 10 times to clean up the plot a little bit.

Results

As it turns out, programmers prefer to keep it simple by seeding with 0 in the vast majority of times, followed by 1, then 1234. Interestingly, there seems to be some preferences of random seeds by language. Apparently, Ruby and Java programmers are most happy to express their geekiness by using a seed of 42, while 5 seems to be popular only for C++ and C# programmers. I'm not sure what the special significance of 380843 as a random seed (used exclusively in C++ programs) is.

Discussion

I'm not happy with the small amount of data for this analysis, so if anybody knows of a better way to get more data on random seeds, I would be happy to integrate this!