Patsy:類別變數的對比編碼系統¶
注意
本文檔基於UCLA 提供的這份優秀資源。
一個具有 K 個類別或層級的類別變數,通常會以 K-1 個虛擬變數的形式進入迴歸模型。這相當於對層級均值進行線性假設。也就是說,這些變數的每個檢定統計量都相當於檢定該層級的均值是否在統計上顯著不同於基準類別的均值。這種虛擬編碼在 R 語言中稱為「處理編碼 (Treatment coding)」,我們將遵循此慣例。然而,還有不同的編碼方法,相當於不同的線性假設集合。
事實上,虛擬編碼在技術上並非對比編碼。這是因為虛擬變數加總為 1,並且在功能上不獨立於模型的截距。另一方面,對於一個具有 k 個層級的類別變數,一組對比 (contrasts) 是一組 k-1 個功能上獨立的因子層級均值線性組合,並且獨立於虛擬變數的總和。虛擬編碼本身並沒有錯。它可以捕捉所有的係數,但當模型假設係數之間相互獨立時,例如在變異數分析 (ANOVA) 中,會使問題複雜化。線性迴歸模型不假設係數之間相互獨立,因此虛擬編碼通常是此類情況下唯一教導的編碼方式。
為了了解 Patsy 中的對比矩陣,我們將使用來自 UCLA ATS 的資料。首先,讓我們載入資料。
範例資料¶
In [1]: import pandas
In [2]: url = 'https://stats.idre.ucla.edu/stat/data/hsb2.csv'
In [3]: hsb2 = pandas.read_csv(url)
觀察因變數 write 在每個種族層級的均值將會很有幫助 ((1 = 西班牙裔, 2 = 亞洲裔, 3 = 非洲裔和 4 = 高加索裔))。
In [4]: hsb2.groupby('race')['write'].mean()
Out[4]:
race
1 46.458333
2 58.000000
3 48.200000
4 54.055172
Name: write, dtype: float64
處理(虛擬)編碼¶
虛擬編碼可能是最廣為人知的編碼方案。它將類別變數的每個層級與基準參考層級進行比較。基準參考層級是截距的值。它是 Patsy 中無序類別因子的預設對比。種族的處理對比矩陣將會是
In [5]: from patsy.contrasts import Treatment
In [6]: levels = [1,2,3,4]
In [7]: contrast = Treatment(reference=0).code_without_intercept(levels)
In [8]: print(contrast.matrix)
[[0. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
這裡我們使用了 reference=0,這表示第一個層級,西班牙裔,是測量其他層級效果的基準類別。如上所述,各列的總和不為零,因此不獨立於截距。為了清楚起見,讓我們看看這將如何編碼 race 變數。
In [9]: contrast.matrix[hsb2.race-1, :][:20]
Out[9]:
array([[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]])
這有點技巧,因為 race 類別方便地映射到從零開始的索引。如果不是這樣,則會在後台進行此轉換,因此這並非普遍適用,但對於理解概念仍是一個有用的練習。以下說明使用上述三種對比的輸出
In [10]: from statsmodels.formula.api import ols
In [11]: mod = ols("write ~ C(race, Treatment)", data=hsb2)
In [12]: res = mod.fit()
In [13]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 46.4583 1.842 25.218 0.000 42.825 50.091
C(race, Treatment)[T.2] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Treatment)[T.3] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Treatment)[T.4] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 8.25
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
我們明確地給出了種族的對比;然而,由於處理是預設值,我們可以省略它。
簡單編碼¶
與處理編碼類似,簡單編碼會將每個層級與固定的參考層級進行比較。但是,使用簡單編碼時,截距是所有因子層級的總均值。有關如何實作簡單對比,請參閱使用者定義編碼。
In [14]: contrast = Simple().code_without_intercept(levels)
In [15]: print(contrast.matrix)
[[-0.25 -0.25 -0.25]
[ 0.75 -0.25 -0.25]
[-0.25 0.75 -0.25]
[-0.25 -0.25 0.75]]
In [16]: mod = ols("write ~ C(race, Simple)", data=hsb2)
In [17]: res = mod.fit()
In [18]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Simple)[Simp.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Simple)[Simp.2] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Simple)[Simp.3] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
總和(偏差)編碼¶
總和編碼將給定層級的因變數均值與所有層級的因變數總均值進行比較。也就是說,它使用前 k-1 個層級中的每個層級與第 k 層級之間的對比。在此範例中,將層級 1 與所有其他層級進行比較,層級 2 與所有其他層級進行比較,層級 3 與所有其他層級進行比較。
In [19]: from patsy.contrasts import Sum
In [20]: contrast = Sum().code_without_intercept(levels)
In [21]: print(contrast.matrix)
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[-1. -1. -1.]]
In [22]: mod = ols("write ~ C(race, Sum)", data=hsb2)
In [23]: res = mod.fit()
In [24]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Sum)[S.1] -5.2200 1.631 -3.200 0.002 -8.437 -2.003
C(race, Sum)[S.2] 6.3216 2.160 2.926 0.004 2.061 10.582
C(race, Sum)[S.3] -3.4784 1.732 -2.008 0.046 -6.895 -0.062
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 6.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
這對應於強制所有係數總和為零的參數化。請注意,此處的截距是總均值,其中總均值是每個層級的因變數均值的均值。
In [25]: hsb2.groupby('race')['write'].mean().mean()
Out[25]: np.float64(51.67837643678162)
後向差分編碼¶
在後向差分編碼中,會將層級的因變數均值與前一個層級的因變數均值進行比較。這種編碼類型對於名義變數或順序變數可能有用。
In [26]: from patsy.contrasts import Diff
In [27]: contrast = Diff().code_without_intercept(levels)
In [28]: print(contrast.matrix)
[[-0.75 -0.5 -0.25]
[ 0.25 -0.5 -0.25]
[ 0.25 0.5 -0.25]
[ 0.25 0.5 0.75]]
In [29]: mod = ols("write ~ C(race, Diff)", data=hsb2)
In [30]: res = mod.fit()
In [31]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Diff)[D.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Diff)[D.2] -9.8000 3.388 -2.893 0.004 -16.481 -3.119
C(race, Diff)[D.3] 5.8552 2.153 2.720 0.007 1.610 10.101
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 8.30
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
例如,此處層級 1 的係數是層級 2 的 write 均值與層級 1 的均值進行比較。也就是:
In [32]: res.params["C(race, Diff)[D.1]"]
Out[32]: np.float64(11.541666666666575)
In [33]: hsb2.groupby('race').mean()["write"].loc[2] - \
....: hsb2.groupby('race').mean()["write"].loc[1]
....:
Out[33]: np.float64(11.541666666666664)
Helmert 編碼¶
我們的 Helmert 編碼版本有時稱為反向 Helmert 編碼。層級的因變數均值與先前所有層級的因變數均值進行比較。因此,有時會使用「反向」這個名稱來區分正向 Helmert 編碼。這種比較對於名義變數(例如種族)沒有太多意義,但我們會像這樣使用 Helmert 對比
In [34]: from patsy.contrasts import Helmert
In [35]: contrast = Helmert().code_without_intercept(levels)
In [36]: print(contrast.matrix)
[[-1. -1. -1.]
[ 1. -1. -1.]
[ 0. 2. -1.]
[ 0. 0. 3.]]
In [37]: mod = ols("write ~ C(race, Helmert)", data=hsb2)
In [38]: res = mod.fit()
In [39]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Helmert)[H.2] 5.7708 1.643 3.512 0.001 2.530 9.011
C(race, Helmert)[H.3] -1.3431 0.867 -1.548 0.123 -3.054 0.368
C(race, Helmert)[H.4] 0.7923 0.372 2.130 0.034 0.059 1.526
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.26
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
為了說明,層級 4 的比較是先前三個層級的因變數均值減去層級 4 的均值
In [40]: grouped = hsb2.groupby('race')
In [41]: grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
Out[41]: np.float64(3.169061302681982)
如你所見,它們只差一個常數。其他版本的 Helmert 對比會給出均值的實際差異。無論如何,假設檢定是相同的。
In [42]: k = 4
In [43]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[43]: np.float64(0.7922653256704955)
In [44]: k = 3
In [45]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[45]: np.float64(-1.3430555555555561)
正交多項式編碼¶
對於 k=4 個層級的多項式編碼採用的係數是類別變數的線性、二次和三次趨勢。此處假設類別變數由一個等間距的基礎數值變數表示。因此,這種編碼類型僅用於等間距的有序類別變數。一般而言,多項式對比產生 k-1 階多項式。由於 race 不是有序因子變數,讓我們以 read 為例。首先,我們需要從 read 建立一個有序類別。
In [46]: _, bins = np.histogram(hsb2.read, 3)
In [47]: try: # requires numpy main
....: readcat = np.digitize(hsb2.read, bins, True)
....: except:
....: readcat = np.digitize(hsb2.read, bins)
....:
In [48]: hsb2['readcat'] = readcat
In [49]: hsb2.groupby('readcat').mean()['write']
Out[49]:
readcat
0 46.000000
1 44.980392
2 53.356436
3 60.127660
Name: write, dtype: float64
In [50]: from patsy.contrasts import Poly
In [51]: levels = hsb2.readcat.unique().tolist()
In [52]: contrast = Poly().code_without_intercept(levels)
In [53]: print(contrast.matrix)
[[-0.6708 0.5 -0.2236]
[-0.2236 -0.5 0.6708]
[ 0.2236 -0.5 -0.6708]
[ 0.6708 0.5 0.2236]]
In [54]: mod = ols("write ~ C(readcat, Poly)", data=hsb2)
In [55]: res = mod.fit()
In [56]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.320
Model: OLS Adj. R-squared: 0.309
Method: Least Squares F-statistic: 30.73
Date: Thu, 03 Oct 2024 Prob (F-statistic): 2.51e-16
Time: 16:08:41 Log-Likelihood: -694.54
No. Observations: 200 AIC: 1397.
Df Residuals: 196 BIC: 1410.
Df Model: 3
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 51.1161 2.018 25.324 0.000 47.135 55.097
C(readcat, Poly).Linear 11.3501 5.348 2.122 0.035 0.803 21.897
C(readcat, Poly).Quadratic 3.8954 4.037 0.965 0.336 -4.066 11.857
C(readcat, Poly).Cubic -2.4598 1.998 -1.231 0.220 -6.400 1.480
==============================================================================
Omnibus: 9.741 Durbin-Watson: 1.699
Prob(Omnibus): 0.008 Jarque-Bera (JB): 10.263
Skew: -0.535 Prob(JB): 0.00591
Kurtosis: 2.703 Cond. No. 13.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
如你所見,readcat 對因變數 write 有顯著的線性效應,但沒有顯著的二次或三次效應。
使用者定義編碼¶
如果你想使用自己的編碼,則必須通過編寫一個包含 code_with_intercept 和 code_without_intercept 方法的編碼類別來實現,這些方法會傳回 patsy.contrast.ContrastMatrix 的實例。
In [57]: from patsy.contrasts import ContrastMatrix
....:
....: def _name_levels(prefix, levels):
....: return ["[%s%s]" % (prefix, level) for level in levels]
....:
In [58]: class Simple(object):
....: def _simple_contrast(self, levels):
....: nlevels = len(levels)
....: contr = -1./nlevels * np.ones((nlevels, nlevels-1))
....: contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels
....: return contr
....:
....: def code_with_intercept(self, levels):
....: contrast = np.column_stack((np.ones(len(levels)),
....: self._simple_contrast(levels)))
....: return ContrastMatrix(contrast, _name_levels("Simp.", levels))
....:
....: def code_without_intercept(self, levels):
....: contrast = self._simple_contrast(levels)
....: return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
....:
File <tokenize>:13
def code_without_intercept(self, levels):
^
IndentationError: unindent does not match any outer indentation level
In [60]: mod = ols("write ~ C(race, Simple)", data=hsb2)
....: res = mod.fit()
....: print(res.summary())
....:
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 16:08:41 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Simple)[Simp.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Simple)[Simp.2] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Simple)[Simp.3] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.