8 Random Variables, Sets and Sample Spaces

Because we first learn about variables in an algebra class, we tend to think of variables as having values that can be solved for—if we have enough information about them. If I say that $x$ is a variable and that $x+6=8$ , we can use algebra to find that $x$ must equal 2.

Random variables are not like algebraic variables. Random variables simply take on values because of some random process. If we say that the outcome of a throw of a six-sided die is a random variable, there is nothing to “solve for.” There is no equation that determines the value of the die. Instead, it is determined by chance and the physical constraints of the die. That is, the outcome must be one of the numbers printed on the die, and the six numbers are equally likely to occur. This illustrates an important point. The word random here does not mean “anything can happen.” On a six-sided die, you will never roll a 7, 3.5, $\sqrt{\pi}$ , −36,000, or any other number that does not appear on the six sides of the die. Random variables have outcomes that are subject to random processes, but those random processes do have constraints on them such that some outcomes are more likely than others—and some outcomes never occur at all.

Random variables have values that are determined by a random process.

Figure 8.1: Rolling a six-sided die is a process that creates a randomly ordered series of integers from 1 to 6.

When we say that the throw of a six-sided die is a random variable, we are not talking about any particular throw of a particular die but, in a sense, every throw (that has ever happened or ever could happen) of every die (that has ever existed or could exist). Imagine an immense, roaring, neverending, cascading flow of dice falling from the sky. As each die lands and disappears, a giant scoreboard nearby records the relative frequencies of ones, twos, threes, fours, fives, and sixes. That’s a random variable.

R Code for Figure 8.1

# Number of throws
k <- 50

# Dot positions
d <- map2_df(sample(1:6,k, replace = T), 1:k, makedice)  

# Plot
p <- ggplot(d) +
  geom_point(pch = 16, size = 50, aes(x, y))  +
  ob_ellipse(
    a = 1.85,
    b = 1.85,
    m1 = 12,
    linewidth = 4
  ) +
  coord_equal() +
  theme_void() +
  theme(legend.position = "none") +
  transition_manual(id) 

# Render animation
p_gif <- animate(
  p,
  fps = 1,
  device = "svg",
  renderer = magick_renderer(),
  width = 8,
  height = 8
)
p_gif

8.1 Sets

A set refers to a collection of objects. Each distinct object in a set is an element.

A set is a collection of distinct objects.An element is a distinct member of a set.

8.1.1 Discrete Sets

To show that a list of discrete elements is a discrete set, we can use curly braces. For example, the set of positive single-digit even numbers is $\{2, 4, 6, 8\}$ . With large sets with repeating patterns, it is convenient to use an ellipsis (“…”), the punctuation mark signifying an omission or pause. For example, rather than listing every two-digit positive even number, we can show the pattern like so:

A discrete set has numbers that are isolated, meaning that each number has a range in which it is the only number in the overall set.

$\{10, 12, 14,\ldots, 98\}$

If we want the pattern to repeat forever, we can set an ellipsis on the left, right, or both sides. The set of odd integers extends to infinity in both directions:

$\{\ldots, -5, -3, -1, 1, 3, 5, \ldots\}$

8.1.2 Interval Sets

With continuous variables, we can define sets in terms of intervals. Whereas the discrete set $\{0,1\}$ refers just to the numbers 0 and 1, the interval set $(0,1)$ refers to all the numbers between 0 and 1.

Intervals are a continous range of numbers.

As shown in Figure 8.2, some intervals include their endpoints and others do not. Intervals noted with square brackets include their endpoints and intervals written with parentheses exclude them. Some intervals extend to positive or negative infinity: $(-\infty,5]$ and $(-8,+\infty)$ . Use a parenthesis with infinity instead of a square bracket because infinity is not a specific number that can be included in an interval.

R Code for Figure 8.2

# Interval notation
tibble(lb = 1L,
       ub = 5L,
       y = 1:4,
       meaning = c("includes 1 and 5",
                   "excludes 1 and 5",
                   "includes 1 but not 5",
                   "includes 5 but not 1"),
       l_bracket = c("[", "(", "[", "("),
       u_bracket = c("]", ")", ")", "]")) %>% 
  mutate(Interval = paste0(l_bracket, lb, ",", ub, u_bracket) %>% 
           fct_inorder() %>% 
           fct_rev,
         l_fill = ifelse(l_bracket == "[", myfills[1], "white"),
         u_fill = ifelse(u_bracket == "]", myfills[1], "white")) %>% 
  ggplot(aes(lb, Interval)) + 
  geom_segment(aes(xend = ub, yend = Interval), 
               linewidth = 2, 
               color = myfills[1]) +
  geom_point(aes(fill = l_fill), 
             size = 5, 
             pch = 21, 
             stroke = 2, color = myfills[1]) + 
  geom_point(aes(fill = u_fill, x = ub), 
             size = 5, 
             pch = 21, 
             stroke = 2, 
             color = myfills[1]) +
  geom_label(aes(label = paste(Interval, meaning), x = 3), 
             vjust = -.75, 
             label.padding = unit(0,"lines"), 
             label.size = 0, 
             family = bfont, 
             size = ggtext_size(27)) +
  scale_fill_identity() +
  scale_y_discrete(NULL, expand = expansion(c(0.08, 0.25))) +
  scale_x_continuous(NULL, minor_breaks = NULL) +
  theme_minimal(base_size = 27, 
                base_family = bfont) + 
  theme(axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        axis.line.x = element_line(linewidth = .5),
        axis.ticks.x = element_line(linewidth = .25), 
        plot.margin = margin())

8.2 Sample Spaces

The set of all possible outcomes of a random variable is the sample space. Continuing with our example, the sample space of a single throw of a six-sided die is the set $\{1,2,3,4,5,6\}$ . Sample space is a curious term. Why sample and why space? With random variables, populations are infinitely large, at least theoretically. Random variables just keep spitting out numbers forever! So any time we actually observe numbers generated by a random variable, we are always observing a sample; actual infinities cannot be observed in their entirety. A space is a set that has mathematical structure. Most random variables generate either integers or real numbers, both of which are structured in many ways (e.g., order).

A sample space is the set of all possible values that a random variable can assume.A population consists of all entities under consideration.A sample is a subset of a population.

Unlike distributions having to do with dice, many distributions have a sample space with an infinite number of elements. Interestingly, there are two kinds of infinity we can consider. A distribution’s sample space might be the set of whole numbers: $\{0,1,2,...\}$ , which extends to positive infinity. The sample space of all integers extends to infinity in both directions: $\{...-2,-1,0,1,2,...\}$ .

The sample space of continuous variables is infinitely large for another reason. Between any two points in a continuous distribution, there is an infinite number of other points. For example, in the beta distribution, the sample space consists of all real numbers between 0 and 1: $(0,1)$ . Many continuous distributions have sample spaces that involve both kinds of infinity. For example, the sample space of the normal distribution consists of all real numbers from negative infinity to positive infinity: $(-\infty, +\infty)$ .