Abstract.
It has been noted that in many professional sports leagues a good predictor of a team’s end of season won-loss percentage is Bill James’ Pythagorean Formula RSobsγRSobsγ+RAobsγsuperscriptsubscriptRSobsݛ¾superscriptsubscriptRSobsݛ¾superscriptsubscriptRAobsݛ¾\frac\rm RS_\rm obs^\gamma\rm RS_\rm obs^\gamma+\rm RA_\rm obs% ^\gammadivide start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + roman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG, where RSobssubscriptRSobs\rm RS_\rm obsroman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT (resp. RAobssubscriptRAobs\rm RA_\rm obsroman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT) is the observed average number of runs scored (allowed) per game and γݛ¾\gammaitalic_γ is a constant for the league; for baseball the best agreement is when γݛ¾\gammaitalic_γ is about 1.821.821.821.82. This formula is often used in the middle of a season to determine if a team is performing above or below expectations, and estimate their future standings.
We provide a theoretical justification for this formula and value of γݛ¾\gammaitalic_γ by modeling the number of runs scored and allowed in baseball games as independent random variables drawn from Weibull distributions with the same βݛ½\betaitalic_β and γݛ¾\gammaitalic_γ but different αݛ¼\alphaitalic_α; the probability density is
f(x;α,β,γ)={γα((x-β)/α)γ-1e-((x-β)/α)γif x≥β0otherwise.ݑ“ݑ¥ݛ¼ݛ½ݛ¾casesݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1superscriptݑ’superscriptݑ¥ݛ½ݛ¼ݛ¾if x≥β0otherwise.f(x;\alpha,\beta,\gamma)\ =\ \begin{cases}\frac{\gamma}{\alpha}\ ((x-\beta)/% \alpha)^{\gamma-1}\ e^{-((x-\beta)/\alpha)^{\gamma}}&\text{\rm if $x\geq\beta$}\\ 0&\text{\rm otherwise.}\end{cases}\ italic_f ( italic_x ; italic_α , italic_β , italic_γ ) = { start_ROW start_CELL divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL start_CELL if italic_x ≥ italic_β end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW
This model leads to a predicted won-loss percentage of (RS-β)γ(RS-β)γ+(RA-β)γsuperscriptRSݛ½ݛ¾superscriptRSݛ½ݛ¾superscriptRAݛ½ݛ¾\frac{({\rm RS}-\beta)^{\gamma}}{({\rm RS}-\beta)^{\gamma}+({\rm RA}-\beta)^{% \gamma}}divide start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + ( roman_RA - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG; here RSRS{\rm RS}roman_RS (resp. RARA{\rm RA}roman_RA) is the mean of the Weibull random variable corresponding to runs scored (allowed), and RS-βRSݛ½{\rm RS}-\betaroman_RS - italic_β (resp. RA-βRAݛ½{\rm RA}-\betaroman_RA - italic_β) is an estimator of RSobssubscriptRSobs{\rm RS_{\rm obs}}roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT (resp. RAobssubscriptRAobs{\rm RA_{\rm obs}}roman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT). An analysis of the 14 American League teams from the 2004 baseball season shows that (1) given that the runs scored and allowed in a game cannot be equal, the runs scored and allowed are statistically independent; (2) the best fit Weibull parameters attained from a least squares analysis and the method of maximum likelihood give good fits. Specifically, least squares yields a mean value of γݛ¾\gammaitalic_γ of 1.791.791.791.79 (with a standard deviation of .09.09.09.09) and maximum likelihood yields a mean value of γݛ¾\gammaitalic_γ of 1.741.741.741.74 (with a standard deviation of .06.06.06.06), which agree beautifully with the observed best value of 1.821.821.821.82 attained by fitting RSobsγRSobsγ+RAobsγsuperscriptsubscriptRSobsݛ¾superscriptsubscriptRSobsݛ¾superscriptsubscriptRAobsݛ¾\frac{{\rm RS_{\rm obs}}^{\gamma}}{{\rm RS_{\rm obs}}^{\gamma}+{\rm RA_{\rm obs% }}^{\gamma}}divide start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + roman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG to the observed winning percentages.
Key words and phrases:
Pythagorean Won-Loss Formula, Weibull Distribution, Hypothesis Testing
2000 Mathematics Subject Classification:
46N30 (primary), 62F03, 62P99 (secondary).
The goal of this paper is to derive Bill James’ Pythagorean Formula (see [Ja], as well as [An, Ol]) from reasonable assumptions about the distribution of scores. Given a sports league, if the observed average number of runs a team scores and allows are RSobssubscriptRSobs{\rm RS_{\rm obs}}roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT and RAobssubscriptRAobs{\rm RA_{\rm obs}}roman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT, then the Pythagorean Formula predicts the team’s won-loss percentage should be RSobsγRSobsγ+RAobsγsuperscriptsubscriptRSobsݛ¾superscriptsubscriptRSobsݛ¾superscriptsubscriptRAobsݛ¾\frac{{\rm RS_{\rm obs}}^{\gamma}}{{\rm RS_{\rm obs}}^{\gamma}+{\rm RA_{\rm obs% }}^{\gamma}}divide start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RS start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + roman_RA start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG for some γݛ¾\gammaitalic_γ which is constant for the league. Initially in baseball the exponent γݛ¾\gammaitalic_γ was taken to be 2222 (which led to the page_seo_title), though fitting γݛ¾\gammaitalic_γ to the observed records from many seasons lead to the best γݛ¾\gammaitalic_γ being about 1.821.821.821.82. Often this formula is applied part way through a season to estimate a team’s end of season standings. For example, if halfway through a season a team has far more wins than this formula predicts, analysts often claim the team is playing over their heads and predict they will have a worse second-half.
Rather than trying to find the best γݛ¾\gammaitalic_γ by looking at many teams’ won-loss percentages, we take a different approach and derive the formula and optimal value of γݛ¾\gammaitalic_γ by modeling the runs scored and allowed each game for a team as independent random variables drawn from Weibull distributions with the same βݛ½\betaitalic_β and γݛ¾\gammaitalic_γ but different αݛ¼\alphaitalic_α (see §3 for an analysis of the 2004 season which shows that, subject to the condition that the runs scored and allowed in a game must be distinct integers, the runs scored and allowed are statistically independent, and §4 for additional comments on the independence). Recall the three-parameter Weibull distribution (see also [Fe2]) is
f(x;α,β,γ)={γα(x-βα)γ-1e-((x-β)/α)γif x≥β0otherwise.ݑ“ݑ¥ݛ¼ݛ½ݛ¾casesݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1superscriptݑ’superscriptݑ¥ݛ½ݛ¼ݛ¾if x≥β0otherwise.f(x;\alpha,\beta,\gamma)\ =\ \begin{cases}\frac{\gamma}{\alpha}\left(\frac{x-% \beta}{\alpha}\right)^{\gamma-1}e^{-((x-\beta)/\alpha)^{\gamma}}&\text{\rm if % $x\geq\beta$}\\ 0&\text{\rm otherwise.}\end{cases}italic_f ( italic_x ; italic_α , italic_β , italic_γ ) = { start_ROW start_CELL divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL start_CELL if italic_x ≥ italic_β end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW (1.1)
We denote the means by RSRS{\rm RS}roman_RS and RARA{\rm RA}roman_RA, and we show below that RS-βRSݛ½{\rm RS}-\betaroman_RS - italic_β (resp. RA-βRAݛ½{\rm RA}-\betaroman_RA - italic_β) is an estimator of the observed average number of runs scored (resp. allowed) per game. The reason RS-βRSݛ½{\rm RS}-\betaroman_RS - italic_β and not RSRS{\rm RS}roman_RS is the estimator of the observed average runs scored per game is due to the discreteness of the runs scored data; this is described in greater detail below. Our main theoretical result is proving that this model leads to a predicted won-loss percentage of
Won-Loss Percentage(RS,RA,β,γ)=(RS-β)γ(RS-β)γ+(RA-β)γ;Won-Loss PercentageRSRAݛ½ݛ¾superscriptRSݛ½ݛ¾superscriptRSݛ½ݛ¾superscriptRAݛ½ݛ¾\mbox{\rm Won-Loss Percentage}({\rm RS},{\rm RA},\beta,\gamma)\ =\ \frac{({\rm RS}-\beta)^{\gamma% }}{({\rm RS}-\beta)^{\gamma}+({\rm RA}-\beta)^{\gamma}};Won-Loss Percentage ( roman_RS , roman_RA , italic_β , italic_γ ) = divide start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + ( roman_RA - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG ; (1.2)
note for all γݛ¾\gammaitalic_γ that if RS=RARSRA{\rm RS}={\rm RA}roman_RS = roman_RA in (2.6) then as we would expect the won-loss percentage is 50%percent5050\%50 %.
In §3 we analyze in great detail the 2004 baseball season for the 14 teams of the American League. Complete results of each game are readily available (see for example [Al]), which greatly facilitates curve fitting and error analysis. For each of these teams we used the method of least squares and the method of maximum likelihood to find the best fit Weibulls to the runs scored and allowed per game (with each having the same γݛ¾\gammaitalic_γ and both having β=-.5ݛ½.5\beta=-.5italic_β = - .5; we explain why this is the right choice for βݛ½\betaitalic_β below). Standard χ2superscriptݜ’2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT tests (see for example [CaBe]) show our fits are adequate. For continuous random variables representing runs scored and runs allowed, there is zero probability of both having the same value; the situation is markedly different in the discrete case. In a baseball game runs scored and allowed cannot be entirely independent, as games do not end in ties; however, modulo this condition, modified χ2superscriptݜ’2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT tests (see [BF, SD]) do show that, given that runs scored and allowed per game must be distinct integers, the runs scored and allowed per game are statistically independent. See [Ci] for more on the independence of runs scored and allowed.
Thus the assumptions of our theoretical model are met, and the Pythagorean Formula should hold for some exponent γݛ¾\gammaitalic_γ. Our main experimental result is that, averaging over the 14 teams, the method of least squares yields a mean of γݛ¾\gammaitalic_γ of 1.791.791.791.79 with a standard deviation of .09.09.09.09 (the median is 1.791.791.791.79 as well); the method of maximum likelihood yields a mean of γݛ¾\gammaitalic_γ of 1.741.741.741.74 with a standard deviation of .06.06.06.06 (the median is 1.761.761.761.76). This is in line with the numerical observation that γ=1.82ݛ¾1.82\gamma=1.82italic_γ = 1.82 is the best exponent.
In order to obtain simple closed form expressions for the probability of scoring more runs than allowing in a game, we assume that the runs scored and allowed are drawn from continuous and not discrete distributions. This allows us to replace discrete sums with continuous integrals, and in general integration leads to more tractable calculations than summations. Of course assumptions of continuous run distribution cannot be correct in baseball, but the hope is that such a computationally useful assumption is a reasonable approximation to reality; it may be more reasonable in a sport such as basketball, and this would make an additional, interesting project. Closed form expressions for the mean, variance and probability that one random variable exceeds another are difficult for general probability distributions; however, the integrations that arise from a Weibull distribution with parameters (α,β,γ)ݛ¼ݛ½ݛ¾(\alpha,\beta,\gamma)( italic_α , italic_β , italic_γ ) are very tractable. Further, as the three parameter Weibull is a very flexible family and takes on a variety of different shapes, it is not surprising that for an appropriate choice of parameters it is a good fit to the runs scored (or allowed) per game. What is fortunate is that we can get good fits to both runs scored and allowed simultaneously, using the same γݛ¾\gammaitalic_γ for each; see [BFAM] for additional problems modeled with Weibull distributions. For example, γ=1ݛ¾1\gamma=1italic_γ = 1 is the exponential and γ=2ݛ¾2\gamma=2italic_γ = 2 is the Rayleigh distribution. Note the great difference in behavior between these two distributions. The exponential’s maximum probability is at x=βݑ¥ݛ½x=\betaitalic_x = italic_β, whereas the Rayleigh is zero at x=βݑ¥ݛ½x=\betaitalic_x = italic_β. Additionally, for any M>βݑ€ݛ½M>\betaitalic_M >italic_β any Weibull has a non-zero probability of a team scoring (or allowing) more than Mݑ€Mitalic_M runs, which is absurd of course in the real world. The tail probabilities of the exponential are significantly greater than those of the Rayleigh, which indicates that perhaps something closer to the Rayleigh than the exponential is the truth for the distribution of runs.
We have incorporated a translation parameter βݛ½\betaitalic_β for several reasons. First, to facilitate applying this model to sports other than baseball. For example, in basketball no team scores fewer than 20 points in a game, and it is not unreasonable to look at the distribution of scores above a baseline. A second consequence of βݛ½\betaitalic_β is that adding PݑƒPitalic_P points to both the runs scored and runs allowed each game does not change the won-loss percentage; this is reflected beautifully in (1.2), and indicates that it is more natural to measure scores above a baseline (which may be zero). Finally, and most importantly, as remarked there are issues in the discreteness of the data and the continuity of the model. In the least squares and maximum likelihood curve fitting we bin the runs scored and allowed data into bins of length 1111; for example, a natural choice of bins is
[0,1)∪[1,2)∪⋯∪[9,10)∪[10,12)∪[12,∞).0112⋯910101212[0,1)\ \cup\ [1,2)\ \cup\ \cdots\ \cup\ [9,10)\ \cup\ [10,12)\ \cup\ [12,% \infty).[ 0 , 1 ) ∪ [ 1 , 2 ) ∪ ⋯ ∪ [ 9 , 10 ) ∪ [ 10 , 12 ) ∪ [ 12 , ∞ ) . (1.3)
As baseball scores are non-negative integers, all of the mass in each bin is at the left endpoint. If we use untranslated Weibulls (i.e., β=0ݛ½0\beta=0italic_β = 0) there would be a discrepancy in matching up the means.
For example, consider a simple case when in half the games the team scores 0 runs and in the other half they score 1. Let us take as our bins [0,1)01[0,1)[ 0 , 1 ) and [1,2)12[1,2)[ 1 , 2 ), and for ease of exposition we shall find the best fit function constant on each bin. Obviously we take our function to be identically 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG on [0,2)02[0,2)[ 0 , 2 ); however, the observed mean is 12â‹…0+12â‹…1=12â‹…120â‹…12112\frac{1}{2}\cdot 0+\frac{1}{2}\cdot 1=\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG â‹… 0 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG â‹… 1 = divide start_ARG 1 end_ARG start_ARG 2 end_ARG whereas the mean of our piecewise constant approximant is 1111. If instead we chose [-.5,.5).5.5[-.5,.5)[ - .5 , .5 ) and [.5,1.5).51.5[.5,1.5)[ .5 , 1.5 ) as our bins then the approximant would also have a mean of 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Returning to our model, we see a better choice of bins is
[-.5,.5]∪[.5,1.5]∪⋯∪[7.5,8.5]∪[8.5,9.5]∪[9.5,11.5]∪[11.5,∞)..5.5.51.5⋯7.58.58.59.59.511.511.5[-.5,.5]\ \cup\ [.5,1.5]\ \cup\ \cdots\ \cup\ [7.5,8.5]\ \cup\ [8.5,9.5]\ \cup% \ [9.5,11.5]\ \cup\ [11.5,\infty).[ - .5 , .5 ] ∪ [ .5 , 1.5 ] ∪ ⋯ ∪ [ 7.5 , 8.5 ] ∪ [ 8.5 , 9.5 ] ∪ [ 9.5 , 11.5 ] ∪ [ 11.5 , ∞ ) . (1.4)
An additional advantage of the bins of (1.4) is that we may consider either open or closed endpoints, as there are no baseball scores that are half-integral. Thus, in order to have the baseball scores in the center of their bins, we take β=-.5ݛ½.5\beta=-.5italic_β = - .5 and use the bins in (1.4). In particular, if the mean of the Weibull approximating the runs scored (resp. allowed) per game is RSRS{\rm RS}roman_RS (resp. RARA{\rm RA}roman_RA) then RS-βRSݛ½{\rm RS}-\betaroman_RS - italic_β (resp. RA-βRAݛ½{\rm RA}-\betaroman_RA - italic_β) is an estimator of the observed average number of runs scored (resp. allowed) per game.
2. Theoretical Model and Predictions
We determine the mean of a Weibull distribution with parameters (α,β,γ)ݛ¼ݛ½ݛ¾(\alpha,\beta,\gamma)( italic_α , italic_β , italic_γ ), and then use this to prove our main result, the Pythagorean Formula (Theorem 2.2). Let f(x;α,β,γ)ݑ“ݑ¥ݛ¼ݛ½ݛ¾f(x;\alpha,\beta,\gamma)italic_f ( italic_x ; italic_α , italic_β , italic_γ ) be the probability density of a Weibull with parameters (α,β,γ)ݛ¼ݛ½ݛ¾(\alpha,\beta,\gamma)( italic_α , italic_β , italic_γ ):
f(x;α,β,γ)={γα(x-βα)γ-1e-((x-β)/α)γif x≥β0otherwise.ݑ“ݑ¥ݛ¼ݛ½ݛ¾casesݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1superscriptݑ’superscriptݑ¥ݛ½ݛ¼ݛ¾if x≥β0otherwise.f(x;\alpha,\beta,\gamma)\ =\ \begin{cases}\frac{\gamma}{\alpha}\left(\frac{x-% \beta}{\alpha}\right)^{\gamma-1}e^{-((x-\beta)/\alpha)^{\gamma}}&\text{\rm if % $x\geq\beta$}\\ 0&\text{\rm otherwise.}\end{cases}italic_f ( italic_x ; italic_α , italic_β , italic_γ ) = { start_ROW start_CELL divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL start_CELL if italic_x ≥ italic_β end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW (2.1)
For s∈ℂݑ ℂs\in\mathbb{C}italic_s ∈ blackboard_C with the real part of sݑ sitalic_s greater than 00, recall the ΓΓ\Gammaroman_Γ-function (see [Fe1]) is defined by
Γ(s)=∫0∞e-uus-1du=∫0∞e-uusduu.Γݑ superscriptsubscript0superscriptݑ’ݑ¢superscriptݑ¢ݑ 1differential-dݑ¢superscriptsubscript0superscriptݑ’ݑ¢superscriptݑ¢ݑ dݑ¢ݑ¢\Gamma(s)\ =\ \int_{0}^{\infty}e^{-u}u^{s-1}{\mathrm{d}}u\ =\ \int_{0}^{\infty% }e^{-u}u^{s}\frac{{\mathrm{d}}u}{u}.roman_Γ ( italic_s ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_u end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT roman_d italic_u = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_u end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT divide start_ARG roman_d italic_u end_ARG start_ARG italic_u end_ARG . (2.2)
Letting μα,β,γsubscriptݜ‡ݛ¼ݛ½ݛ¾\mu_{\alpha,\beta,\gamma}italic_μ start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT denote the mean of f(x;α,β,γ)ݑ“ݑ¥ݛ¼ݛ½ݛ¾f(x;\alpha,\beta,\gamma)italic_f ( italic_x ; italic_α , italic_β , italic_γ ), we have
μα,β,γsubscriptݜ‡ݛ¼ݛ½ݛ¾\displaystyle\mu_{\alpha,\beta,\gamma}italic_μ start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT =\displaystyle\ =\ = ∫β∞x⋅γα(x-βα)γ-1e-((x-β)/α)γdxsuperscriptsubscriptݛ½⋅ݑ¥ݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1superscriptݑ’superscriptݑ¥ݛ½ݛ¼ݛ¾differential-dݑ¥\displaystyle\int_{\beta}^{\infty}x\cdot\frac{\gamma}{\alpha}\left(\frac{x-% \beta}{\alpha}\right)^{\gamma-1}e^{-((x-\beta)/\alpha)^{\gamma}}{\mathrm{d}}x∫ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_x ⋅ divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_d italic_x (2.3)
=\displaystyle== ∫β∞αx-βα⋅γα(x-βα)γ-1e-((x-β)/α)γdx+β.superscriptsubscriptݛ½⋅ݛ¼ݑ¥ݛ½ݛ¼ݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1superscriptݑ’superscriptݑ¥ݛ½ݛ¼ݛ¾differential-dݑ¥ݛ½\displaystyle\int_{\beta}^{\infty}\alpha\frac{x-\beta}{\alpha}\cdot\frac{% \gamma}{\alpha}\left(\frac{x-\beta}{\alpha}\right)^{\gamma-1}e^{-((x-\beta)/% \alpha)^{\gamma}}{\mathrm{d}}x\ +\ \beta.∫ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ⋅ divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_d italic_x + italic_β .
We change variables by setting u=(x-βα)γݑ¢superscriptݑ¥ݛ½ݛ¼ݛ¾u=\left(\frac{x-\beta}{\alpha}\right)^{\gamma}italic_u = ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT. Then du=γα(x-βα)γ-1dxdݑ¢ݛ¾ݛ¼superscriptݑ¥ݛ½ݛ¼ݛ¾1dݑ¥{\mathrm{d}}u=\frac{\gamma}{\alpha}\left(\frac{x-\beta}{\alpha}\right)^{\gamma% -1}{\mathrm{d}}xroman_d italic_u = divide start_ARG italic_γ end_ARG start_ARG italic_α end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT roman_d italic_x and we have
μα,β,γsubscriptݜ‡ݛ¼ݛ½ݛ¾\displaystyle\mu_{\alpha,\beta,\gamma}italic_μ start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT =\displaystyle\ =\ = ∫0∞αuγ-1⋅e-udu+βsuperscriptsubscript0⋅ݛ¼superscriptݑ¢superscriptݛ¾1superscriptݑ’ݑ¢differential-dݑ¢ݛ½\displaystyle\int_{0}^{\infty}\alpha u^{\gamma^{-1}}\cdot e^{-u}{\mathrm{d}}u% \ +\ \beta∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α italic_u start_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - italic_u end_POSTSUPERSCRIPT roman_d italic_u + italic_β (2.4)
=\displaystyle== α∫0∞e-uu1+γ-1duu+βݛ¼superscriptsubscript0superscriptݑ’ݑ¢superscriptݑ¢1superscriptݛ¾1dݑ¢ݑ¢ݛ½\displaystyle\alpha\int_{0}^{\infty}e^{-u}u^{1+\gamma^{-1}}\frac{{\mathrm{d}}u% }{u}\ +\ \betaitalic_α ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_u end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT divide start_ARG roman_d italic_u end_ARG start_ARG italic_u end_ARG + italic_β
=\displaystyle== αΓ(1+γ-1)+β.ݛ¼Γ1superscriptݛ¾1ݛ½\displaystyle\alpha\Gamma(1+\gamma^{-1})\ +\ \beta.italic_α roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_β .
A similar calculation determines the variance. We record these results:
Lemma 2.1.
The mean μα,β,γsubscriptݜ‡ݛ¼ݛ½ݛ¾\mu_{\alpha,\beta,\gamma}italic_μ start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT and variance σα,β,γ2subscriptsuperscriptݜŽ2ݛ¼ݛ½ݛ¾\sigma^{2}_{\alpha,\beta,\gamma}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT of a Weibull with parameters (α,β,γ)ݛ¼ݛ½ݛ¾(\alpha,\beta,\gamma)( italic_α , italic_β , italic_γ ) are
μα,β,γsubscriptݜ‡ݛ¼ݛ½ݛ¾\displaystyle\mu_{\alpha,\beta,\gamma}italic_μ start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT =\displaystyle\ =\ = αΓ(1+γ-1)+βݛ¼Γ1superscriptݛ¾1ݛ½\displaystyle\alpha\Gamma(1+\gamma^{-1})+\betaitalic_α roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_β
σα,β,γ2subscriptsuperscriptݜŽ2ݛ¼ݛ½ݛ¾\displaystyle\sigma^{2}_{\alpha,\beta,\gamma}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , italic_β , italic_γ end_POSTSUBSCRIPT =\displaystyle\ =\ = α2Γ(1+2γ-1)-α2Γ(1+γ-1)2.superscriptݛ¼2Γ12superscriptݛ¾1superscriptݛ¼2Γsuperscript1superscriptݛ¾12\displaystyle\alpha^{2}\Gamma\left(1+2\gamma^{-1}\right)-\alpha^{2}\Gamma\left% (1+\gamma^{-1}\right)^{2}.italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ ( 1 + 2 italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2.5)
We can now prove our main result:
Theorem 2.2 (Pythagorean Won-Loss Formula).
Let the runs scored and runs allowed per game be two independent random variables drawn from Weibull distributions with parameters (αRS,β,γ)subscriptݛ¼normal-RSݛ½ݛ¾(\alpha_{\rm RS},\beta,\gamma)( italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT , italic_β , italic_γ ) and (αRA,β,γ)subscriptݛ¼normal-RAݛ½ݛ¾(\alpha_{\rm RA},\beta,\gamma)( italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT , italic_β , italic_γ ) respectively, where αRSsubscriptݛ¼normal-RS\alpha_{\rm RS}italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT and αRAsubscriptݛ¼normal-RA\alpha_{\rm RA}italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT are chosen so that the means are RSnormal-RS{\rm RS}roman_RS and RAnormal-RA{\rm RA}roman_RA. If γ>0ݛ¾0\gamma>0italic_γ >0 then
Won-Loss Percentage(RS,RA,β,γ)=(RS-β)γ(RS-β)γ+(RA-β)γ.Won-Loss PercentageRSRAݛ½ݛ¾superscriptRSݛ½ݛ¾superscriptRSݛ½ݛ¾superscriptRAݛ½ݛ¾\mbox{\rm Won-Loss Percentage}({\rm RS},{\rm RA},\beta,\gamma)\ =\ \frac{({\rm RS}-\beta)^{\gamma% }}{({\rm RS}-\beta)^{\gamma}+({\rm RA}-\beta)^{\gamma}}.Won-Loss Percentage ( roman_RS , roman_RA , italic_β , italic_γ ) = divide start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ( roman_RS - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + ( roman_RA - italic_β ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG . (2.6)
Proof.
Let Xݑ‹Xitalic_X and YݑŒYitalic_Y be independent random variables with Weibull distributions (αRS,β,γ)subscriptݛ¼RSݛ½ݛ¾(\alpha_{\rm RS},\beta,\gamma)( italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT , italic_β , italic_γ ) and (αRA,β,γ)subscriptݛ¼RAݛ½ݛ¾(\alpha_{\rm RA},\beta,\gamma)( italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT , italic_β , italic_γ ) respectively, where Xݑ‹Xitalic_X is the number of runs scored and YݑŒYitalic_Y the number of runs allowed per game. As the means are RSRS{\rm RS}roman_RS and RARA{\rm RA}roman_RA, by Lemma 2.1 we have
RSRS\displaystyle{\rm RS}\ roman_RS =\displaystyle\ =\ = αRSΓ(1+γ-1)+βsubscriptݛ¼RSΓ1superscriptݛ¾1ݛ½\displaystyle\alpha_{\rm RS}\Gamma(1+\gamma^{-1})+\betaitalic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_β
RARA\displaystyle{\rm RA}roman_RA =\displaystyle\ =\ = αRAΓ(1+γ-1)+β.subscriptݛ¼RAΓ1superscriptݛ¾1ݛ½\displaystyle\alpha_{\rm RA}\Gamma(1+\gamma^{-1})+\beta.italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_β . (2.7)
Equivalently, we have
αRSsubscriptݛ¼RS\displaystyle\alpha_{\rm RS}italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT =\displaystyle\ =\ = RS-βΓ(1+γ-1)RSݛ½Γ1superscriptݛ¾1\displaystyle\frac{{\rm RS}-\beta}{\Gamma(1+\gamma^{-1})}divide start_ARG roman_RS - italic_β end_ARG start_ARG roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG
αRAsubscriptݛ¼RA\displaystyle\alpha_{\rm RA}italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT =\displaystyle\ =\ = RA-βΓ(1+γ-1).RAݛ½Γ1superscriptݛ¾1\displaystyle\frac{{\rm RA}-\beta}{\Gamma(1+\gamma^{-1})}.divide start_ARG roman_RA - italic_β end_ARG start_ARG roman_Γ ( 1 + italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG . (2.8)
We need only calculate the probability that Xݑ‹Xitalic_X exceeds YݑŒYitalic_Y. Below we constantly use the integral of a probability density is 1111. We have
Prob(X>Y)=∫x=β∞∫y=βxf(x;αRS,β,γ)f(y;αRA,β,γ)dydxProbݑ‹ݑŒsuperscriptsubscriptݑ¥ݛ½superscriptsubscriptݑ¦ݛ½ݑ¥ݑ“ݑ¥subscriptݛ¼RSݛ½ݛ¾ݑ“ݑ¦subscriptݛ¼RAݛ½ݛ¾differential-dݑ¦differential-dݑ¥\displaystyle\mbox{Prob}(X>Y)\ =\ \int_{x=\beta}^{\infty}\int_{y=\beta}^{x}f(x% ;\alpha_{\rm RS},\beta,\gamma)f(y;\alpha_{\rm RA},\beta,\gamma){\mathrm{d}}y\;% {\mathrm{d}}xProb ( italic_X >italic_Y ) = ∫ start_POSTSUBSCRIPT italic_x = italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_y = italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_f ( italic_x ; italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT , italic_β , italic_γ ) italic_f ( italic_y ; italic_α start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT , italic_β , italic_γ ) roman_d italic_y roman_d italic_x
=∫x=β∞∫y=βxγαRS(x-βαRS)γ-1e-((x-β)/αRS)γγαRA(y-βαRA)γ-1e-((y-β)/αRA)γdydxabsentsuperscriptsubscriptݑ¥ݛ½superscriptsubscriptݑ¦ݛ½ݑ¥ݛ¾subscriptݛ¼RSsuperscriptݑ¥ݛ½subscriptݛ¼ݑ…ݑ†ݛ¾1superscriptݑ’superscriptݑ¥ݛ½subscriptݛ¼RSݛ¾ݛ¾subscriptݛ¼RAsuperscriptݑ¦ݛ½subscriptݛ¼RAݛ¾1superscriptݑ’superscriptݑ¦ݛ½subscriptݛ¼RAݛ¾differential-dݑ¦differential-dݑ¥\displaystyle=\ \int_{x=\beta}^{\infty}\int_{y=\beta}^{x}\frac{\gamma}{\alpha_% {\rm RS}}\left(\frac{x-\beta}{\alpha_{RS}}\right)^{\gamma-1}e^{-((x-\beta)/% \alpha_{\rm RS})^{\gamma}}\frac{\gamma}{\alpha_{\rm RA}}\left(\frac{y-\beta}{% \alpha_{{\rm RA}}}\right)^{\gamma-1}e^{-((y-\beta)/\alpha_{\rm RA})^{\gamma}}{% \mathrm{d}}y\;{\mathrm{d}}x= ∫ start_POSTSUBSCRIPT italic_x = italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_y = italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_x - italic_β end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_R italic_S end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( ( italic_x - italic_β ) / italic_α start_POSTSUBSCRIPT roman_RS end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT divi
|