30 Days of Statistics

Day 1: Basics of Measurement

June 1, 2021

Measurement is the process of assigning numbers to objects to facilitate the use of mathematics.
Levels of Measurement: Nominal and Ordinal data, Interval and Ratio Data, Continuous and discrete Data
Operationalization :The process of specifying how a concept will be defined and measured. For example, intelligence has no definite metric; instead, we use IQ score to measure intelligence. Other examples include disaster preparedness, quality of life, pain, etc.
Proxy Measurement :The process of substituting one measurement with another.
Errors :Reducible errors / systemic errors and Irreducible errors / random errors
Evaluation of Methods of Measurement :Reducible errors / systemic errors and Irreducible errors / random errors

Day 2: Evaluation of Methods of measurement

June 2, 2021

Measurement is the process of assigning numbers to objects to facilitate the use of mathematics.
Reliability:How consistent or repeated measurements are. Many measures of reliability draw on the correlation coefficient. Three primary approaches: multiple occasion reliability, multiple forms reliability, and internal consistency.
Validity :Multiple form: How similarly different versions of testing perform in measuring the same entity. Multiple form: How similarly different versions of testing perform in measuring the same entity. Internal consistency: How well the items that make up an instrument reflect the same construct.
Triangulation :The process of combining different information from different sources to arrive at a true or, at least, most accurate value. For example, Multi-method matrix (MTMM)

Day 3: Method of Central Tendency

June 3, 2021

Among the first statistics computed for the continuous variables in a new dataset. These measures indicate where most values in a distribution fall .
The three most common measures of central tendency are the mean, the median, and the mode. One’s choice of mean, median, or mode can dramatically change the interpretation of the data.
The mean is the arithmetic average (the sum of the scores divided by the number of scores). While the mean is the most common measure of a dataset’s center, there are situations when it cannot be used.
Mean is not an appropriate summary measure for every dataset because it is sensitive to extreme values (outliers), so can be misleading for Skewed data.
Trimmed Mean also known as Winsorized Mean is calculated by discarding certain percentage of extreme values in a distribution and calculating mean of the remaining values.
When the data are nominal (i.e., when the data are categories rather than values), you must use the mode (Most frequently occurring value) to summarize the center
Median is the best option when data are ordinal. The median of a distribution of scores is the value at the 50th percentile, which means that half of the scores are below this value and half are above it.
To summarize interval or ratio data, in general, you should use the mean. however, if the data set contains one or more outliers, you should use the median.
If mean and median are close to each other, and the most common ranges cluster around the mean , then we can conclude that data is Normal and Symmetrical
A mean lower than the median is typical of left-skewed data because the extreme lower values pull the mean down, whereas they do not have the same effect on the median
A mean higher than a median is common for right-skewed data because the extreme higher values pull the mean up but do not have the same effect on the median.

Day 4: Percentiles

June 4, 2021

In real world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects?
We cannot alert on every slow transaction, since there are always some. The industry has come up with a solution called Automatic Baselining. Baselining calculates out the “normal” performance and only alerts us when an application slows down or produces more errors than us
Most approaches rely on averages and standard deviations. This approach assumes that the response times are distributed over a bell curve. Averages , in this case, are ineffective because they are too simplistic and one-dimensional.
A percentile gives a much better sense of real world performance, it tells me at which part of the curve I am looking at and how many transactions are represented by that metric.
For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.
Percentile-based alerts do not suffer from false positives, are a lot less volatile and don’t miss any important performance degradations! Consequently, a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.

Reference: Why Averages Suck and Percentiles are Great

Day 5: Discrete and Continuous Random Variables

June 5, 2021

A Variable is a quantity whose value changes
A Discrete variable is a variable whose value is obtained by counting. eg: Number of students in a class, number of marbles in a jar
A Continuous Variable is a variable whose value is obtained by measuring eg: Height of students in a class
A random variable is a variable whose value is determined by random chance. A random variable is denoted by a capital letter Probability distribution of a random variable X tells what the possible values of X are and how probabilities are assigned to those values. Random variable can be discrete or continuous
A Discrete random variable X has countable number of possible values. To graph the probability distribution of X, we construct probability histogram
A Continuous random variable X takes all values in a given interval of values. The Probability distribution of a continuous random variable is shown by a density curve. The Probability that X is between an interval of numbers is the area under the density curve between the interval points. The probability that X is exactly equal to a number is Zero.
Expected Value($E_x$): If you have a collection of numbers, $a_1 \ldots a_n$, then the average of the numbers is a representation of the collection. Now consider a collection of random variables $X$, then the average of the random variables is called the Expected Value.
To understand the concept behind $E_X$, consider a discrete random variable with range $R_X = \{x_1, x_2, x_3, \ldots\}$. This random variable is a result of a random experiment. Suppose that we repeat this experiment a very large number of times $N$, and that the trials are independent. Let $N_1$ be the number of times we observe $x_1$, $N_2$ be the number of times we observe $x_2$, $N_k$ be the number of times we observe $x_k$, and so on. Since $P(X=x_k) = P_X(x_k)$, we expect that

$$P_X(x_1) \approx \frac{N_1}{N}$$

$$P_X(x_2) \approx \frac{N_2}{N}$$

$$P_X(x_k) \approx \frac{N_k}{N}$$

In other words, we have $N_k \approx N \cdot P_X(x_k)$. Now, if we take the average of the observed values of $X$, we obtain

$$\text{Average} = \frac{N_1x_1 + N_2x_2 + N_3x_3 + \ldots}{N}$$

$$= \frac{N \cdot P_X(x_1) \cdot x_1 + N \cdot P_X(x_2) \cdot x_2 + \ldots}{N}$$

$$= P_X(x_1) \cdot x_1 + P_X(x_2) \cdot x_2 + \ldots$$

$$= \sum P_X(x_i) \cdot x_i$$

Thus, the intuition behind $E_X$ is that if you repeat the random experiment independently $N$ times and take the average of the observed data, the average gets closer and closer to $E_X$ as $N$ gets larger and larger.

Day 6: Likelihood

June 6, 2021

Likelihood is different for discrete and continuous random variable
Discrete Random Variables : Suppose that you have a stochastic process that takes discrete values ( eg. Outcomes of tossing of coin 10 times) In such cases, we calculate probability of particular set of outcomes by making assumptions about the underlying stochastic process ( eg Probability of landing Heads is $p$ and that coin tosses are independent). Let the observed outcomes is $O$ and set of parameters that describe the stochastic process as $\theta$, then when we speak about probability we mean $P(O|\theta)$
However, when we model real-life stochastic processes, we often do not know $\theta$; we simply observe $O$, and then our goal is to estimate $\theta$.

We know that given a value of $\theta$, the probability of observing $O$ is $P(O|\theta)$. Thus, a 'natural' estimation process is to choose that value of $\theta$ that would maximize the probability that we would actually observe $O$.

Find the parameter values $\theta$ that maximize the following function:

$$L(\theta|O) = P(O|\theta)$$

L(\theta|O) is called the likelihood function. Notice that by definition, the likelihood function is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$.
Continuous Random Variable:In the continuous case, the situation is similar with one important difference. We can no longer talk about the probability that we observed $O$ given $\theta$ because in the continuous case $P(O|\theta) = 0$.

Denote the probability density function (pdf) associated with the outcomes $O$ as: $f(O|\theta)$. Thus, in the continuous case, we estimate $\theta$ given observed outcomes $O$ by maximizing the following function:

$$L(\theta|O) = f(O|\theta)$$