加權廣義線性模型

[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

加權 GLM:泊松反應數據

載入數據

在這個範例中,我們將使用婚外情數據集,使用一些外生變數來預測婚外情發生率。

將產生權重來顯示 freq_weights 等同於重複的數據記錄。另一方面,var_weights 等同於聚合數據。

[2]:
print(sm.datasets.fair.NOTE)
::

    Number of observations: 6366
    Number of variables: 9
    Variable name definitions:

        rate_marriage   : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
                        4 = good, 5 = very good
        age             : Age
        yrs_married     : No. years married. Interval approximations. See
                        original paper for detailed explanation.
        children        : No. children
        religious       : How relgious, 1 = not, 2 = mildly, 3 = fairly,
                        4 = strongly
        educ            : Level of education, 9 = grade school, 12 = high
                        school, 14 = some college, 16 = college graduate,
                        17 = some graduate school, 20 = advanced degree
        occupation      : 1 = student, 2 = farming, agriculture; semi-skilled,
                        or unskilled worker; 3 = white-colloar; 4 = teacher
                        counselor social worker, nurse; artist, writers;
                        technician, skilled worker, 5 = managerial,
                        administrative, business, 6 = professional with
                        advanced degree
        occupation_husb : Husband's occupation. Same as occupation.
        affairs         : measure of time spent in extramarital affairs

    See the original paper for more details.

將數據載入 pandas dataframe。

[3]:
data = sm.datasets.fair.load_pandas().data

相依(內生)變數為 affairs

[4]:
data.describe()
[4]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
count 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000
mean 4.109645 29.082862 9.009425 1.396874 2.426170 14.209865 3.424128 3.850141 0.705374
std 0.961430 6.847882 7.280120 1.433471 0.878369 2.178003 0.942399 1.346435 2.203374
min 1.000000 17.500000 0.500000 0.000000 1.000000 9.000000 1.000000 1.000000 0.000000
25% 4.000000 22.000000 2.500000 0.000000 2.000000 12.000000 3.000000 3.000000 0.000000
50% 4.000000 27.000000 6.000000 1.000000 2.000000 14.000000 3.000000 4.000000 0.000000
75% 5.000000 32.000000 16.500000 2.000000 3.000000 16.000000 4.000000 5.000000 0.484848
max 5.000000 42.000000 23.000000 5.500000 4.000000 20.000000 6.000000 6.000000 57.599991
[5]:
data[:3]
[5]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 0.111111
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 3.230769
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 1.400000

在接下來的步驟中,我們主要使用泊松。雖然使用小數的 affairs 可以運作,但我們將它們轉換為整數以獲得計數分佈。

[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 1.0
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 4.0
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 2.0
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313,  934,  488,  180,  130,  172,    7,   21,   67,    2,    0,
          0,   17,    0,    0,    0,    3,   12,    8,    0,    0,    0,
          0,    0,    2,    2,    2,    3,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    1,    1,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    1])

壓縮和聚合觀測值

我們的原始數據集中有 6366 個觀測值。當我們只考慮一些選定的變數時,我們擁有的唯一觀測值會更少。在接下來的步驟中,我們以兩種方式組合觀測值,首先我們組合所有變數值都相同的觀測值,其次我們組合具有相同解釋變數的觀測值。

具有唯一觀測值的數據集

我們使用 pandas 的 groupby 來組合相同的觀測值,並建立一個新的變數 freq 來計算有多少個觀測值具有對應行的值。

[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
    data2["affairs rate_marriage age yrs_married const".split()]
    .groupby("affairs rate_marriage age yrs_married".split())
    .count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs rate_marriage age yrs_married freq
0 0.0 1.0 17.5 0.5 1
1 0.0 1.0 22.0 2.5 3
2 0.0 1.0 27.0 2.5 1
3 0.0 1.0 27.0 6.0 5
4 0.0 1.0 27.0 9.0 1

具有唯一解釋變數的數據集 (exog)

對於下一個數據集,我們組合具有相同解釋變數值的觀測值。但是,由於反應變數在組合的觀測值之間可能不同,因此我們計算所有組合觀測值的反應變數的平均值和總和。

我們再次使用 pandas groupby 來組合觀測值並建立新的變數。我們也將 MultiIndex 展平為簡單索引。

[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
    "rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])


def merge_tuple(tpl):
    if isinstance(tpl, tuple) and len(tpl) > 1:
        return "_".join(map(str, tpl))
    else:
        return tpl


df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage age yrs_married affairs_mean affairs_sum affairs_count
0 1.0 17.5 0.5 0.000000 0.0 1
1 1.0 22.0 2.5 3.900000 39.0 10
2 1.0 27.0 2.5 3.400000 17.0 5
3 1.0 27.0 6.0 0.900000 9.0 10
4 1.0 27.0 9.0 1.333333 4.0 3

在組合觀測值後,我們有一個具有 467 個唯一觀測值的數據框 dc 和一個具有 130 個解釋變數唯一值的觀測值的數據框 df_a

[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)

分析

在接下來的步驟中,我們將原始數據的 GLM 泊松結果與組合觀測值的模型進行比較,其中多重性或聚合由權重或 exposure 給出。

原始數據

[12]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=data,
    family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                 6366
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.2420
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363214)

壓縮數據(具有頻率的唯一觀測值)

結合相同的觀測值並使用頻率權重來考慮觀測值的多重性會產生完全相同的結果。當我們想要有關觀測值而非有關所有相同觀測值的總和的資訊時,某些結果屬性會有所不同。例如,殘差不考慮 freq_weights

[14]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363196)

使用 var_weights 而非 freq_weights 壓縮

接下來,我們將 var_weightsfreq_weights 進行比較。當內生變數反映平均值而不是相同的觀測值時,將 var_weights 納入是一種常見的做法。我不認為有理論上的理由說明為什麼它會產生相同的結果(一般而言)。

這會產生相同的結果,但 df_residfreq_weights 範例不同,因為 var_weights 不會改變有效觀測值的數量。

[16]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                      472
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

從結果計算得出的離散度不正確,因為 df_resid 錯誤。如果我們使用原始的 df_resid,它是正確的。

[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512002), np.float64(5.078702313363196))

聚合或平均數據(解釋變數的唯一值)

對於這些情況,我們結合具有相同解釋變數值的觀測值。相應的反應變數是總和或平均值。

使用 exposure

如果我們的相依變數是所有組合觀測值的反應總和,那麼在泊松假設下,分佈保持不變,但我們的 exposure 會根據由一個聚合觀測值表示的個體數量而變化。

參數估計和參數共變異數與原始數據相同,但對數概似、離差和皮爾森卡方不同

[18]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:07   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.35078910917951)

使用 var_weights

我們也可以使用相依變數所有組合值的平均值。在這種情況下,變異數將與一個組合觀測值反映的總 exposure 的倒數相關。

[20]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:           affairs_mean   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -5954.2
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:07   Pearson chi2:                     926.
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

比較

我們在上面的摘要列印中看到,paramscov_params 與相關的 Wald 推論在不同版本之間一致。我們在下面總結這一點,比較不同版本之間的個別結果屬性。

參數估計值 params、參數的標準誤差 bse 和參數的 pvalues(用於檢定參數是否為零)都一致。但是,概似和擬合優度統計資料 llfdeviancepearson_chi2 僅部分一致。具體而言,聚合版本與使用原始數據的結果不一致。

警告llfdeviancepearson_chi2 的行為在未來版本中可能仍會變更。

對於解釋變數的唯一值,反應變數的總和與平均值都具有適當的概似解釋。但是,此解釋並未反映在這三個統計資料中。計算上,這可能是由於使用聚合數據時缺少調整所致。但是,理論上我們可以思考在這些情況下,特別是當概似分析不適用時,對於 var_weights,結果應解釋為準概似估計。 var_weights 的定義存在歧義,因為它們可用於具有正確指定概似的平均值,以及用於準概似情況下的變異數調整。我們目前不嘗試匹配概似規格。但是,在下一節中,我們將顯示,當我們假設基礎模型已正確指定時,對於所有聚合版本,概似比類型檢定仍會產生相同的結果。

[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o res_f res_e res_a
截距 2.715533 2.715533 2.715533 2.715533
rate_marriage -0.495180 -0.495180 -0.495180 -0.495180
age -0.029914 -0.029914 -0.029914 -0.029914
yrs_married -0.010763 -0.010763 -0.010763 -0.010763
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o res_f res_e res_a
截距 0.107360 0.107360 0.107360 0.107360
rate_marriage 0.011874 0.011874 0.011874 0.011874
age 0.004471 0.004471 0.004471 0.004471
yrs_married 0.004294 0.004294 0.004294 0.004294
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o res_f res_e res_a
截距 3.756282e-141 3.756280e-141 3.756282e-141 3.756282e-141
rate_marriage 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
age 2.221918e-11 2.221918e-11 2.221918e-11 2.221918e-11
yrs_married 1.219200e-02 1.219200e-02 1.219200e-02 1.219200e-02
[25]:
pd.DataFrame(
    np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
    columns=names,
    index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o res_f res_e res_a
llf -10350.913296 -10350.913296 -740.748534 -5954.219866
deviance 15374.679054 15374.679054 967.455734 967.455734
皮爾森卡方 32310.704118 32310.704118 926.199428 926.199428

概似比類型檢定

我們在上面看到,聚合數據和原始個別數據之間的概似和相關統計數據不一致。我們在下面說明,概似比檢定和離差差異在不同版本之間一致,但是皮爾森卡方檢定不一致。

如同之前:這還不夠清楚,可能會改變。

作為測試案例,我們刪除 age 變數,並計算概似比類型統計量,作為簡化模型或受限模型與完整模型或非受限模型之間的差異。

原始觀測值和頻率權重

[26]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.91343161907935),
 np.float64(45.726693322505525),
 np.float64(-22.863346661251853))
[27]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.913431618650066),
 np.float64(45.726693322507344),
 np.float64(-22.863346661253672))

聚合數據:exposurevar_weights

注意:LR 檢定與原始觀測值一致,pearson_chi2 不同,且符號錯誤。

[28]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525103445),
 np.float64(45.72669332250598),
 np.float64(-22.863346661252763))
[29]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510788),
 np.float64(45.72669332250621),
 np.float64(-22.863346661252763))

研究皮爾森卡方統計量

首先,我們做一些健全性檢查,確認 pearson_chi2resid_pearson 的計算沒有基本錯誤。

[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
    res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315148),
 np.float64(926.1994277566182),
 np.float64(894.5809002315148),
 np.float64(926.199427756618))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
    :5
], res_e.mu[:5]
[31]:
(np.float64(-2.815935518950797e-13),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566182)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
    :5
], res_e2.mu[:5]
[33]:
(np.float64(-2.361188168064333e-14),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315148)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832321), np.float64(47104.64779595939))

其中一個導致符號錯誤的可能原因是,我們正在減去除以不同分母的二次項。在一些相關情況下,文獻中的建議是使用公分母。我們可以比較完整模型和簡化模型中使用相同變異數假設的皮爾森卡方統計量。

在這種情況下,我們在所有版本中都獲得了簡化模型和完整模型之間相同的皮爾森卡方縮放差異。(問題 #3616 旨在進一步追蹤此問題。)

[36]:
(
    (res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
    / res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121899)
[37]:
(
    (res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121923)
[38]:
(
    (res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
    / res_f2.model.family.variance(res_f2.mu)
    * res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175122017)
[39]:
(
    (res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
    / res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.43314175121962)

剩餘部分

筆記本的剩餘部分僅包含一些額外檢查,可以忽略。

[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10.,  5., 10.,  3.]), array([ 1, 10,  5, 10,  3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.664817945858474)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315161),
 np.float64(926.199427756624),
 np.float64(-42.34720713518717),
 np.float64(-32.68238918932538))
[44]:
(
    (res_a2._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315161)
[45]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a.mu)
    * res_a.model.var_weights
).sum()
[45]:
np.float64(926.199427756624)
[46]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a2.mu)
    * res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802967)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17.,  9.,  4.]), array([ 0., 39., 17.,  9.,  4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]),
 array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17.,  9.,  4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17.,  9.,  4.])
[51]:
from scipy import stats

stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994542e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept        3.756282e-141
rate_marriage     0.000000e+00
age               2.221918e-11
yrs_married       1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      127
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -763.61
Date:                Thu, 03 Oct 2024   Deviance:                       1013.2
Time:                        15:45:08   Pearson chi2:                     895.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:08   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6363
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10374.
Date:                Thu, 03 Oct 2024   Deviance:                       15420.
Time:                        15:45:08   Pearson chi2:                 3.24e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9729
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:08   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

最後更新:2024 年 10 月 03 日