對比概述¶
[1]:
import numpy as np
import statsmodels.api as sm
本文件主要基於 UCLA 提供的這個優秀資源 http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
一個具有 K 個類別或層級的類別變數,通常會以 K-1 個虛擬變數序列進入迴歸。這相當於對層級平均數進行線性假設。也就是說,這些變數的每個檢定統計量都相當於檢定該層級的平均數是否在統計上顯著不同於基本類別的平均數。這種虛擬編碼在 R 術語中稱為處理編碼,我們將遵循此慣例。然而,有不同的編碼方法相當於不同的線性假設集合。
事實上,虛擬編碼在技術上並不是對比編碼。這是因為虛擬變數加總為一,並且在功能上不獨立於模型的截距。另一方面,具有 k
個層級的類別變數的一組對比,是一組 k-1
個功能上獨立的因子層級平均數的線性組合,這些線性組合也獨立於虛擬變數的總和。虛擬編碼本身並沒有錯。它捕獲了所有的係數,但當模型假設係數的獨立性時(例如在 ANOVA 中)會使事情複雜化。線性迴歸模型不假設係數的獨立性,因此虛擬編碼通常是此上下文中唯一被教導的編碼。
為了查看 Patsy 中的對比矩陣,我們將使用來自 UCLA ATS 的資料。首先,讓我們載入資料。
範例資料¶
[2]:
import pandas as pd
url = "https://stats.idre.ucla.edu/stat/data/hsb2.csv"
hsb2 = pd.read_table(url, delimiter=",")
[3]:
hsb2.head(10)
[3]:
id | female | race | ses | schtyp | prog | read | write | math | science | socst | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 70 | 0 | 4 | 1 | 1 | 1 | 57 | 52 | 41 | 47 | 57 |
1 | 121 | 1 | 4 | 2 | 1 | 3 | 68 | 59 | 53 | 63 | 61 |
2 | 86 | 0 | 4 | 3 | 1 | 1 | 44 | 33 | 54 | 58 | 31 |
3 | 141 | 0 | 4 | 3 | 1 | 3 | 63 | 44 | 47 | 53 | 56 |
4 | 172 | 0 | 4 | 2 | 1 | 2 | 47 | 52 | 57 | 53 | 61 |
5 | 113 | 0 | 4 | 2 | 1 | 2 | 44 | 52 | 51 | 63 | 61 |
6 | 50 | 0 | 3 | 2 | 1 | 1 | 50 | 59 | 42 | 53 | 61 |
7 | 11 | 0 | 1 | 2 | 1 | 2 | 34 | 46 | 45 | 39 | 36 |
8 | 84 | 0 | 4 | 2 | 1 | 1 | 63 | 57 | 54 | 58 | 51 |
9 | 48 | 0 | 3 | 2 | 1 | 2 | 57 | 55 | 52 | 50 | 51 |
查看因變數 write 對於每個 race 層級((1 = 西班牙裔, 2 = 亞洲裔, 3 = 非裔美國人, 4 = 高加索人))的平均數,將會很有幫助。
[4]:
hsb2.groupby("race")["write"].mean()
[4]:
race
1 46.458333
2 58.000000
3 48.200000
4 54.055172
Name: write, dtype: float64
處理(虛擬)編碼¶
虛擬編碼可能是最廣為人知的編碼方案。它將類別變數的每個層級與基本參考層級進行比較。基本參考層級是截距的值。它是 Patsy 中無序類別因子的預設對比。race 的處理對比矩陣將會是
[5]:
from patsy.contrasts import Treatment
levels = [1, 2, 3, 4]
contrast = Treatment(reference=0).code_without_intercept(levels)
print(contrast.matrix)
[[0. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
這裡我們使用了 reference=0
,這表示第一個層級(西班牙裔)是參考類別,其他層級的效果是相對於它來衡量的。如上所述,這些欄位的總和不為零,因此不獨立於截距。為了明確起見,讓我們看看這將如何編碼 race
變數。
[6]:
hsb2.race.head(10)
[6]:
0 4
1 4
2 4
3 4
4 4
5 4
6 3
7 1
8 4
9 3
Name: race, dtype: int64
[7]:
print(contrast.matrix[hsb2.race - 1, :][:20])
[[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]]
[8]:
pd.get_dummies(hsb2.race.values, drop_first=False)
[8]:
1 | 2 | 3 | 4 | |
---|---|---|---|---|
0 | False | False | False | True |
1 | False | False | False | True |
2 | False | False | False | True |
3 | False | False | False | True |
4 | False | False | False | True |
... | ... | ... | ... | ... |
195 | False | True | False | False |
196 | False | False | False | True |
197 | False | False | False | True |
198 | False | False | False | True |
199 | False | False | False | True |
200 列 × 4 欄
這有點技巧,因為 race
類別可以方便地映射到以零為基礎的索引。如果不是這樣,則此轉換會在底層發生,因此這在一般情況下不會起作用,但它仍然是一個有助於固定概念的有用練習。以下說明了使用上述三個對比的輸出
[9]:
from statsmodels.formula.api import ols
mod = ols("write ~ C(race, Treatment)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 15:50:53 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 46.4583 1.842 25.218 0.000 42.825 50.091
C(race, Treatment)[T.2] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Treatment)[T.3] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Treatment)[T.4] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 8.25
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
我們明確給出了 race 的對比;但是,由於 Treatment 是預設值,我們可以省略它。
簡單編碼¶
與處理編碼類似,簡單編碼將每個層級與固定的參考層級進行比較。然而,使用簡單編碼時,截距是因子所有層級的總平均數。Patsy 沒有包含簡單對比,但您可以輕鬆定義自己的對比。為此,請編寫一個包含 code_with_intercept 和 code_without_intercept 方法的類別,該方法會傳回 patsy.contrast.ContrastMatrix 實例
[10]:
from patsy.contrasts import ContrastMatrix
def _name_levels(prefix, levels):
return ["[%s%s]" % (prefix, level) for level in levels]
class Simple(object):
def _simple_contrast(self, levels):
nlevels = len(levels)
contr = -1.0 / nlevels * np.ones((nlevels, nlevels - 1))
contr[1:][np.diag_indices(nlevels - 1)] = (nlevels - 1.0) / nlevels
return contr
def code_with_intercept(self, levels):
contrast = np.column_stack(
(np.ones(len(levels)), self._simple_contrast(levels))
)
return ContrastMatrix(contrast, _name_levels("Simp.", levels))
def code_without_intercept(self, levels):
contrast = self._simple_contrast(levels)
return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
[11]:
hsb2.groupby("race")["write"].mean().mean()
[11]:
np.float64(51.67837643678162)
[12]:
contrast = Simple().code_without_intercept(levels)
print(contrast.matrix)
[[-0.25 -0.25 -0.25]
[ 0.75 -0.25 -0.25]
[-0.25 0.75 -0.25]
[-0.25 -0.25 0.75]]
[13]:
mod = ols("write ~ C(race, Simple)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 15:50:53 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Simple)[Simp.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Simple)[Simp.2] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Simple)[Simp.3] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
總和(偏差)編碼¶
總和編碼將給定層級的因變數平均數與所有層級的因變數總平均數進行比較。也就是說,它使用前 k-1 個層級的每一個層級與層級 k 之間的對比。在此範例中,層級 1 會與所有其他層級進行比較,層級 2 會與所有其他層級進行比較,層級 3 會與所有其他層級進行比較。
[14]:
from patsy.contrasts import Sum
contrast = Sum().code_without_intercept(levels)
print(contrast.matrix)
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[-1. -1. -1.]]
[15]:
mod = ols("write ~ C(race, Sum)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 15:50:53 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Sum)[S.1] -5.2200 1.631 -3.200 0.002 -8.437 -2.003
C(race, Sum)[S.2] 6.3216 2.160 2.926 0.004 2.061 10.582
C(race, Sum)[S.3] -3.4784 1.732 -2.008 0.046 -6.895 -0.062
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 6.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
這對應於一種強制所有係數總和為零的參數化。請注意,此處的截距是總平均數,其中總平均數是每個層級的因變數平均數的平均數。
[16]:
hsb2.groupby("race")["write"].mean().mean()
[16]:
np.float64(51.67837643678162)
反向差異編碼¶
在反向差異編碼中,一個層級的因變數平均數與前一層級的因變數平均數進行比較。這種編碼類型可能適用於名義或序數變數。
[17]:
from patsy.contrasts import Diff
contrast = Diff().code_without_intercept(levels)
print(contrast.matrix)
[[-0.75 -0.5 -0.25]
[ 0.25 -0.5 -0.25]
[ 0.25 0.5 -0.25]
[ 0.25 0.5 0.75]]
[18]:
mod = ols("write ~ C(race, Diff)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 15:50:53 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Diff)[D.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Diff)[D.2] -9.8000 3.388 -2.893 0.004 -16.481 -3.119
C(race, Diff)[D.3] 5.8552 2.153 2.720 0.007 1.610 10.101
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 8.30
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
例如,此處層級 1 的係數是層級 2 的 write
平均數與層級 1 平均數的比較。即,
[19]:
res.params["C(race, Diff)[D.1]"]
hsb2.groupby("race").mean()["write"][2] - hsb2.groupby("race").mean()["write"][1]
[19]:
np.float64(11.541666666666664)
赫爾默特編碼¶
我們的赫爾默特編碼版本有時稱為反向赫爾默特編碼。一個層級的因變數平均數與所有先前層級的因變數平均數進行比較。因此,有時會應用「反向」這個名稱,以區分正向赫爾默特編碼。這種比較對於名義變數(例如 race)沒有太大意義,但我們將像這樣使用赫爾默特對比
[20]:
from patsy.contrasts import Helmert
contrast = Helmert().code_without_intercept(levels)
print(contrast.matrix)
[[-1. -1. -1.]
[ 1. -1. -1.]
[ 0. 2. -1.]
[ 0. 0. 3.]]
[21]:
mod = ols("write ~ C(race, Helmert)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.78e-05
Time: 15:50:53 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Helmert)[H.2] 5.7708 1.643 3.512 0.001 2.530 9.011
C(race, Helmert)[H.3] -1.3431 0.867 -1.548 0.123 -3.054 0.368
C(race, Helmert)[H.4] 0.7923 0.372 2.130 0.034 0.059 1.526
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.26
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
為了說明,層級 4 的比較是先前三個層級的因變數平均數與層級 4 平均數的比較
[22]:
grouped = hsb2.groupby("race")
grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
[22]:
np.float64(3.169061302681982)
[23]:
grouped.mean()["write"]
[23]:
race
1 46.458333
2 58.000000
3 48.200000
4 54.055172
Name: write, dtype: float64
如您所見,這些僅在常數內相等。其他版本的赫爾默特對比給出了平均數的實際差異。無論如何,假設檢定是相同的。
[24]:
k = 4
c4 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
k = 3
c3 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
print(c4, c3)
0.7922653256704955 -1.3430555555555561
正交多項式編碼¶
多項式編碼對於 k=4
個層級所採用的係數是類別變數中的線性、二次和三次趨勢。此處假設類別變數由底層的等距數值變數表示。因此,這種編碼類型僅適用於具有相等間距的有序類別變數。一般來說,多項式對比會產生 k-1
階的多項式。由於 race
不是有序因子變數,因此我們以 read
為例。首先,我們需要從 read
建立有序類別。
[25]:
hsb2["readcat"] = np.asarray(pd.cut(hsb2.read, bins=4))
hsb2["readcat"] = hsb2["readcat"].astype(object)
hsb2.groupby("readcat").mean()["write"]
[25]:
readcat
(27.952, 40.0] 42.772727
(40.0, 52.0] 49.978495
(52.0, 64.0] 56.563636
(64.0, 76.0] 61.833333
Name: write, dtype: float64
[26]:
from patsy.contrasts import Poly
levels = hsb2.readcat.unique()
contrast = Poly().code_without_intercept(levels)
print(contrast.matrix)
[[-0.67082039 0.5 -0.2236068 ]
[-0.2236068 -0.5 0.67082039]
[ 0.2236068 -0.5 -0.67082039]
[ 0.67082039 0.5 0.2236068 ]]
[27]:
mod = ols("write ~ C(readcat, Poly)", data=hsb2)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.346
Model: OLS Adj. R-squared: 0.336
Method: Least Squares F-statistic: 34.51
Date: Thu, 03 Oct 2024 Prob (F-statistic): 5.95e-18
Time: 15:50:53 Log-Likelihood: -690.69
No. Observations: 200 AIC: 1389.
Df Residuals: 196 BIC: 1403.
Df Model: 3
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 52.7870 0.634 83.268 0.000 51.537 54.037
C(readcat, Poly).Linear 14.2587 1.484 9.607 0.000 11.332 17.186
C(readcat, Poly).Quadratic -0.9680 1.268 -0.764 0.446 -3.468 1.532
C(readcat, Poly).Cubic -0.1554 1.006 -0.154 0.877 -2.140 1.829
==============================================================================
Omnibus: 4.467 Durbin-Watson: 1.768
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.289
Skew: -0.307 Prob(JB): 0.117
Kurtosis: 2.628 Cond. No. 3.01
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
如您所見,readcat 對因變數 write
有顯著的線性效應,但二次和三次效應不顯著。