對比概述

[1]:
import numpy as np
import statsmodels.api as sm

本文件主要基於 UCLA 提供的這個優秀資源 http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

一個具有 K 個類別或層級的類別變數,通常會以 K-1 個虛擬變數序列進入迴歸。這相當於對層級平均數進行線性假設。也就是說,這些變數的每個檢定統計量都相當於檢定該層級的平均數是否在統計上顯著不同於基本類別的平均數。這種虛擬編碼在 R 術語中稱為處理編碼,我們將遵循此慣例。然而,有不同的編碼方法相當於不同的線性假設集合。

事實上,虛擬編碼在技術上並不是對比編碼。這是因為虛擬變數加總為一,並且在功能上不獨立於模型的截距。另一方面,具有 k 個層級的類別變數的一組對比,是一組 k-1 個功能上獨立的因子層級平均數的線性組合,這些線性組合也獨立於虛擬變數的總和。虛擬編碼本身並沒有錯。它捕獲了所有的係數,但當模型假設係數的獨立性時(例如在 ANOVA 中)會使事情複雜化。線性迴歸模型不假設係數的獨立性,因此虛擬編碼通常是此上下文中唯一被教導的編碼。

為了查看 Patsy 中的對比矩陣,我們將使用來自 UCLA ATS 的資料。首先,讓我們載入資料。

範例資料

[2]:
import pandas as pd

url = "https://stats.idre.ucla.edu/stat/data/hsb2.csv"
hsb2 = pd.read_table(url, delimiter=",")
[3]:
hsb2.head(10)
[3]:
id female race ses schtyp prog read write math science socst
0 70 0 4 1 1 1 57 52 41 47 57
1 121 1 4 2 1 3 68 59 53 63 61
2 86 0 4 3 1 1 44 33 54 58 31
3 141 0 4 3 1 3 63 44 47 53 56
4 172 0 4 2 1 2 47 52 57 53 61
5 113 0 4 2 1 2 44 52 51 63 61
6 50 0 3 2 1 1 50 59 42 53 61
7 11 0 1 2 1 2 34 46 45 39 36
8 84 0 4 2 1 1 63 57 54 58 51
9 48 0 3 2 1 2 57 55 52 50 51

查看因變數 write 對於每個 race 層級((1 = 西班牙裔, 2 = 亞洲裔, 3 = 非裔美國人, 4 = 高加索人))的平均數,將會很有幫助。

[4]:
hsb2.groupby("race")["write"].mean()
[4]:
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64

處理(虛擬)編碼

虛擬編碼可能是最廣為人知的編碼方案。它將類別變數的每個層級與基本參考層級進行比較。基本參考層級是截距的值。它是 Patsy 中無序類別因子的預設對比。race 的處理對比矩陣將會是

[5]:
from patsy.contrasts import Treatment

levels = [1, 2, 3, 4]
contrast = Treatment(reference=0).code_without_intercept(levels)
print(contrast.matrix)
[[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

這裡我們使用了 reference=0,這表示第一個層級(西班牙裔)是參考類別,其他層級的效果是相對於它來衡量的。如上所述,這些欄位的總和不為零,因此不獨立於截距。為了明確起見,讓我們看看這將如何編碼 race 變數。

[6]:
hsb2.race.head(10)
[6]:
0    4
1    4
2    4
3    4
4    4
5    4
6    3
7    1
8    4
9    3
Name: race, dtype: int64
[7]:
print(contrast.matrix[hsb2.race - 1, :][:20])
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]
[8]:
pd.get_dummies(hsb2.race.values, drop_first=False)
[8]:
1 2 3 4
0 False False False True
1 False False False True
2 False False False True
3 False False False True
4 False False False True
... ... ... ... ...
195 False True False False
196 False False False True
197 False False False True
198 False False False True
199 False False False True

200 列 × 4 欄

這有點技巧,因為 race 類別可以方便地映射到以零為基礎的索引。如果不是這樣,則此轉換會在底層發生,因此這在一般情況下不會起作用,但它仍然是一個有助於固定概念的有用練習。以下說明了使用上述三個對比的輸出

[9]:
from statsmodels.formula.api import ols

mod = ols("write ~ C(race, Treatment)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  46.4583      1.842     25.218      0.000      42.825      50.091
C(race, Treatment)[T.2]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Treatment)[T.3]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Treatment)[T.4]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.25
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我們明確給出了 race 的對比;但是,由於 Treatment 是預設值,我們可以省略它。

簡單編碼

與處理編碼類似,簡單編碼將每個層級與固定的參考層級進行比較。然而,使用簡單編碼時,截距是因子所有層級的總平均數。Patsy 沒有包含簡單對比,但您可以輕鬆定義自己的對比。為此,請編寫一個包含 code_with_intercept 和 code_without_intercept 方法的類別,該方法會傳回 patsy.contrast.ContrastMatrix 實例

[10]:
from patsy.contrasts import ContrastMatrix


def _name_levels(prefix, levels):
    return ["[%s%s]" % (prefix, level) for level in levels]


class Simple(object):
    def _simple_contrast(self, levels):
        nlevels = len(levels)
        contr = -1.0 / nlevels * np.ones((nlevels, nlevels - 1))
        contr[1:][np.diag_indices(nlevels - 1)] = (nlevels - 1.0) / nlevels
        return contr

    def code_with_intercept(self, levels):
        contrast = np.column_stack(
            (np.ones(len(levels)), self._simple_contrast(levels))
        )
        return ContrastMatrix(contrast, _name_levels("Simp.", levels))

    def code_without_intercept(self, levels):
        contrast = self._simple_contrast(levels)
        return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
[11]:
hsb2.groupby("race")["write"].mean().mean()
[11]:
np.float64(51.67837643678162)
[12]:
contrast = Simple().code_without_intercept(levels)
print(contrast.matrix)
[[-0.25 -0.25 -0.25]
 [ 0.75 -0.25 -0.25]
 [-0.25  0.75 -0.25]
 [-0.25 -0.25  0.75]]
[13]:
mod = ols("write ~ C(race, Simple)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

總和(偏差)編碼

總和編碼將給定層級的因變數平均數與所有層級的因變數總平均數進行比較。也就是說,它使用前 k-1 個層級的每一個層級與層級 k 之間的對比。在此範例中,層級 1 會與所有其他層級進行比較,層級 2 會與所有其他層級進行比較,層級 3 會與所有其他層級進行比較。

[14]:
from patsy.contrasts import Sum

contrast = Sum().code_without_intercept(levels)
print(contrast.matrix)
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [-1. -1. -1.]]
[15]:
mod = ols("write ~ C(race, Sum)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Sum)[S.1]    -5.2200      1.631     -3.200      0.002      -8.437      -2.003
C(race, Sum)[S.2]     6.3216      2.160      2.926      0.004       2.061      10.582
C(race, Sum)[S.3]    -3.4784      1.732     -2.008      0.046      -6.895      -0.062
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         6.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

這對應於一種強制所有係數總和為零的參數化。請注意,此處的截距是總平均數,其中總平均數是每個層級的因變數平均數的平均數。

[16]:
hsb2.groupby("race")["write"].mean().mean()
[16]:
np.float64(51.67837643678162)

反向差異編碼

在反向差異編碼中,一個層級的因變數平均數與前一層級的因變數平均數進行比較。這種編碼類型可能適用於名義或序數變數。

[17]:
from patsy.contrasts import Diff

contrast = Diff().code_without_intercept(levels)
print(contrast.matrix)
[[-0.75 -0.5  -0.25]
 [ 0.25 -0.5  -0.25]
 [ 0.25  0.5  -0.25]
 [ 0.25  0.5   0.75]]
[18]:
mod = ols("write ~ C(race, Diff)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Diff)[D.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Diff)[D.2]    -9.8000      3.388     -2.893      0.004     -16.481      -3.119
C(race, Diff)[D.3]     5.8552      2.153      2.720      0.007       1.610      10.101
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.30
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

例如,此處層級 1 的係數是層級 2 的 write 平均數與層級 1 平均數的比較。即,

[19]:
res.params["C(race, Diff)[D.1]"]
hsb2.groupby("race").mean()["write"][2] - hsb2.groupby("race").mean()["write"][1]
[19]:
np.float64(11.541666666666664)

赫爾默特編碼

我們的赫爾默特編碼版本有時稱為反向赫爾默特編碼。一個層級的因變數平均數與所有先前層級的因變數平均數進行比較。因此,有時會應用「反向」這個名稱,以區分正向赫爾默特編碼。這種比較對於名義變數(例如 race)沒有太大意義,但我們將像這樣使用赫爾默特對比

[20]:
from patsy.contrasts import Helmert

contrast = Helmert().code_without_intercept(levels)
print(contrast.matrix)
[[-1. -1. -1.]
 [ 1. -1. -1.]
 [ 0.  2. -1.]
 [ 0.  0.  3.]]
[21]:
mod = ols("write ~ C(race, Helmert)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Helmert)[H.2]     5.7708      1.643      3.512      0.001       2.530       9.011
C(race, Helmert)[H.3]    -1.3431      0.867     -1.548      0.123      -3.054       0.368
C(race, Helmert)[H.4]     0.7923      0.372      2.130      0.034       0.059       1.526
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.26
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

為了說明,層級 4 的比較是先前三個層級的因變數平均數與層級 4 平均數的比較

[22]:
grouped = hsb2.groupby("race")
grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
[22]:
np.float64(3.169061302681982)
[23]:
grouped.mean()["write"]
[23]:
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64

如您所見,這些僅在常數內相等。其他版本的赫爾默特對比給出了平均數的實際差異。無論如何,假設檢定是相同的。

[24]:
k = 4
c4 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
k = 3
c3 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
print(c4, c3)
0.7922653256704955 -1.3430555555555561

正交多項式編碼

多項式編碼對於 k=4 個層級所採用的係數是類別變數中的線性、二次和三次趨勢。此處假設類別變數由底層的等距數值變數表示。因此,這種編碼類型僅適用於具有相等間距的有序類別變數。一般來說,多項式對比會產生 k-1 階的多項式。由於 race 不是有序因子變數,因此我們以 read 為例。首先,我們需要從 read 建立有序類別。

[25]:
hsb2["readcat"] = np.asarray(pd.cut(hsb2.read, bins=4))
hsb2["readcat"] = hsb2["readcat"].astype(object)
hsb2.groupby("readcat").mean()["write"]
[25]:
readcat
(27.952, 40.0]    42.772727
(40.0, 52.0]      49.978495
(52.0, 64.0]      56.563636
(64.0, 76.0]      61.833333
Name: write, dtype: float64
[26]:
from patsy.contrasts import Poly

levels = hsb2.readcat.unique()
contrast = Poly().code_without_intercept(levels)
print(contrast.matrix)
[[-0.67082039  0.5        -0.2236068 ]
 [-0.2236068  -0.5         0.67082039]
 [ 0.2236068  -0.5        -0.67082039]
 [ 0.67082039  0.5         0.2236068 ]]
[27]:
mod = ols("write ~ C(readcat, Poly)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.346
Model:                            OLS   Adj. R-squared:                  0.336
Method:                 Least Squares   F-statistic:                     34.51
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.95e-18
Time:                        15:50:53   Log-Likelihood:                -690.69
No. Observations:                 200   AIC:                             1389.
Df Residuals:                     196   BIC:                             1403.
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     52.7870      0.634     83.268      0.000      51.537      54.037
C(readcat, Poly).Linear       14.2587      1.484      9.607      0.000      11.332      17.186
C(readcat, Poly).Quadratic    -0.9680      1.268     -0.764      0.446      -3.468       1.532
C(readcat, Poly).Cubic        -0.1554      1.006     -0.154      0.877      -2.140       1.829
==============================================================================
Omnibus:                        4.467   Durbin-Watson:                   1.768
Prob(Omnibus):                  0.107   Jarque-Bera (JB):                4.289
Skew:                          -0.307   Prob(JB):                        0.117
Kurtosis:                       2.628   Cond. No.                         3.01
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

如您所見,readcat 對因變數 write 有顯著的線性效應,但二次和三次效應不顯著。


上次更新:2024 年 10 月 03 日