加權廣義線性模型¶
[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
加權 GLM:泊松反應數據¶
載入數據¶
在這個範例中,我們將使用婚外情數據集,使用一些外生變數來預測婚外情發生率。
將產生權重來顯示 freq_weights
等同於重複的數據記錄。另一方面,var_weights
等同於聚合數據。
[2]:
print(sm.datasets.fair.NOTE)
::
Number of observations: 6366
Number of variables: 9
Variable name definitions:
rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
4 = good, 5 = very good
age : Age
yrs_married : No. years married. Interval approximations. See
original paper for detailed explanation.
children : No. children
religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,
4 = strongly
educ : Level of education, 9 = grade school, 12 = high
school, 14 = some college, 16 = college graduate,
17 = some graduate school, 20 = advanced degree
occupation : 1 = student, 2 = farming, agriculture; semi-skilled,
or unskilled worker; 3 = white-colloar; 4 = teacher
counselor social worker, nurse; artist, writers;
technician, skilled worker, 5 = managerial,
administrative, business, 6 = professional with
advanced degree
occupation_husb : Husband's occupation. Same as occupation.
affairs : measure of time spent in extramarital affairs
See the original paper for more details.
將數據載入 pandas dataframe。
[3]:
data = sm.datasets.fair.load_pandas().data
相依(內生)變數為 affairs
[4]:
data.describe()
[4]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
count | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 |
mean | 4.109645 | 29.082862 | 9.009425 | 1.396874 | 2.426170 | 14.209865 | 3.424128 | 3.850141 | 0.705374 |
std | 0.961430 | 6.847882 | 7.280120 | 1.433471 | 0.878369 | 2.178003 | 0.942399 | 1.346435 | 2.203374 |
min | 1.000000 | 17.500000 | 0.500000 | 0.000000 | 1.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 4.000000 | 22.000000 | 2.500000 | 0.000000 | 2.000000 | 12.000000 | 3.000000 | 3.000000 | 0.000000 |
50% | 4.000000 | 27.000000 | 6.000000 | 1.000000 | 2.000000 | 14.000000 | 3.000000 | 4.000000 | 0.000000 |
75% | 5.000000 | 32.000000 | 16.500000 | 2.000000 | 3.000000 | 16.000000 | 4.000000 | 5.000000 | 0.484848 |
max | 5.000000 | 42.000000 | 23.000000 | 5.500000 | 4.000000 | 20.000000 | 6.000000 | 6.000000 | 57.599991 |
[5]:
data[:3]
[5]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 |
在接下來的步驟中,我們主要使用泊松。雖然使用小數的 affairs 可以運作,但我們將它們轉換為整數以獲得計數分佈。
[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 1.0 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.0 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 2.0 |
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313, 934, 488, 180, 130, 172, 7, 21, 67, 2, 0,
0, 17, 0, 0, 0, 3, 12, 8, 0, 0, 0,
0, 0, 2, 2, 2, 3, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1])
壓縮和聚合觀測值¶
我們的原始數據集中有 6366 個觀測值。當我們只考慮一些選定的變數時,我們擁有的唯一觀測值會更少。在接下來的步驟中,我們以兩種方式組合觀測值,首先我們組合所有變數值都相同的觀測值,其次我們組合具有相同解釋變數的觀測值。
具有唯一觀測值的數據集¶
我們使用 pandas 的 groupby 來組合相同的觀測值,並建立一個新的變數 freq
來計算有多少個觀測值具有對應行的值。
[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
data2["affairs rate_marriage age yrs_married const".split()]
.groupby("affairs rate_marriage age yrs_married".split())
.count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs | rate_marriage | age | yrs_married | freq | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 17.5 | 0.5 | 1 |
1 | 0.0 | 1.0 | 22.0 | 2.5 | 3 |
2 | 0.0 | 1.0 | 27.0 | 2.5 | 1 |
3 | 0.0 | 1.0 | 27.0 | 6.0 | 5 |
4 | 0.0 | 1.0 | 27.0 | 9.0 | 1 |
具有唯一解釋變數的數據集 (exog)¶
對於下一個數據集,我們組合具有相同解釋變數值的觀測值。但是,由於反應變數在組合的觀測值之間可能不同,因此我們計算所有組合觀測值的反應變數的平均值和總和。
我們再次使用 pandas groupby
來組合觀測值並建立新的變數。我們也將 MultiIndex
展平為簡單索引。
[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
"rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])
def merge_tuple(tpl):
if isinstance(tpl, tuple) and len(tpl) > 1:
return "_".join(map(str, tpl))
else:
return tpl
df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage | age | yrs_married | affairs_mean | affairs_sum | affairs_count | |
---|---|---|---|---|---|---|
0 | 1.0 | 17.5 | 0.5 | 0.000000 | 0.0 | 1 |
1 | 1.0 | 22.0 | 2.5 | 3.900000 | 39.0 | 10 |
2 | 1.0 | 27.0 | 2.5 | 3.400000 | 17.0 | 5 |
3 | 1.0 | 27.0 | 6.0 | 0.900000 | 9.0 | 10 |
4 | 1.0 | 27.0 | 9.0 | 1.333333 | 4.0 | 3 |
在組合觀測值後,我們有一個具有 467 個唯一觀測值的數據框 dc
和一個具有 130 個解釋變數唯一值的觀測值的數據框 df_a
。
[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)
分析¶
在接下來的步驟中,我們將原始數據的 GLM 泊松結果與組合觀測值的模型進行比較,其中多重性或聚合由權重或 exposure 給出。
原始數據¶
[12]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=data,
family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 6366
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.2420
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363214)
壓縮數據(具有頻率的唯一觀測值)¶
結合相同的觀測值並使用頻率權重來考慮觀測值的多重性會產生完全相同的結果。當我們想要有關觀測值而非有關所有相同觀測值的總和的資訊時,某些結果屬性會有所不同。例如,殘差不考慮 freq_weights
。
[14]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363196)
使用 var_weights
而非 freq_weights
壓縮¶
接下來,我們將 var_weights
與 freq_weights
進行比較。當內生變數反映平均值而不是相同的觀測值時,將 var_weights
納入是一種常見的做法。我不認為有理論上的理由說明為什麼它會產生相同的結果(一般而言)。
這會產生相同的結果,但 df_resid
與 freq_weights
範例不同,因為 var_weights
不會改變有效觀測值的數量。
[16]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 472
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
從結果計算得出的離散度不正確,因為 df_resid
錯誤。如果我們使用原始的 df_resid
,它是正確的。
[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512002), np.float64(5.078702313363196))
聚合或平均數據(解釋變數的唯一值)¶
對於這些情況,我們結合具有相同解釋變數值的觀測值。相應的反應變數是總和或平均值。
使用 exposure
¶
如果我們的相依變數是所有組合觀測值的反應總和,那麼在泊松假設下,分佈保持不變,但我們的 exposure
會根據由一個聚合觀測值表示的個體數量而變化。
參數估計和參數共變異數與原始數據相同,但對數概似、離差和皮爾森卡方不同
[18]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:07 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.35078910917951)
使用 var_weights¶
我們也可以使用相依變數所有組合值的平均值。在這種情況下,變異數將與一個組合觀測值反映的總 exposure 的倒數相關。
[20]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_mean No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -5954.2
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:07 Pearson chi2: 926.
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
比較¶
我們在上面的摘要列印中看到,params
和 cov_params
與相關的 Wald 推論在不同版本之間一致。我們在下面總結這一點,比較不同版本之間的個別結果屬性。
參數估計值 params
、參數的標準誤差 bse
和參數的 pvalues
(用於檢定參數是否為零)都一致。但是,概似和擬合優度統計資料 llf
、deviance
和 pearson_chi2
僅部分一致。具體而言,聚合版本與使用原始數據的結果不一致。
警告:llf
、deviance
和 pearson_chi2
的行為在未來版本中可能仍會變更。
對於解釋變數的唯一值,反應變數的總和與平均值都具有適當的概似解釋。但是,此解釋並未反映在這三個統計資料中。計算上,這可能是由於使用聚合數據時缺少調整所致。但是,理論上我們可以思考在這些情況下,特別是當概似分析不適用時,對於 var_weights
,結果應解釋為準概似估計。 var_weights
的定義存在歧義,因為它們可用於具有正確指定概似的平均值,以及用於準概似情況下的變異數調整。我們目前不嘗試匹配概似規格。但是,在下一節中,我們將顯示,當我們假設基礎模型已正確指定時,對於所有聚合版本,概似比類型檢定仍會產生相同的結果。
[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 2.715533 | 2.715533 | 2.715533 | 2.715533 |
rate_marriage | -0.495180 | -0.495180 | -0.495180 | -0.495180 |
age | -0.029914 | -0.029914 | -0.029914 | -0.029914 |
yrs_married | -0.010763 | -0.010763 | -0.010763 | -0.010763 |
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 0.107360 | 0.107360 | 0.107360 | 0.107360 |
rate_marriage | 0.011874 | 0.011874 | 0.011874 | 0.011874 |
age | 0.004471 | 0.004471 | 0.004471 | 0.004471 |
yrs_married | 0.004294 | 0.004294 | 0.004294 | 0.004294 |
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 3.756282e-141 | 3.756280e-141 | 3.756282e-141 | 3.756282e-141 |
rate_marriage | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
age | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 |
yrs_married | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 |
[25]:
pd.DataFrame(
np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
columns=names,
index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
llf | -10350.913296 | -10350.913296 | -740.748534 | -5954.219866 |
deviance | 15374.679054 | 15374.679054 | 967.455734 | 967.455734 |
皮爾森卡方 | 32310.704118 | 32310.704118 | 926.199428 | 926.199428 |
概似比類型檢定¶
我們在上面看到,聚合數據和原始個別數據之間的概似和相關統計數據不一致。我們在下面說明,概似比檢定和離差差異在不同版本之間一致,但是皮爾森卡方檢定不一致。
如同之前:這還不夠清楚,可能會改變。
作為測試案例,我們刪除 age
變數,並計算概似比類型統計量,作為簡化模型或受限模型與完整模型或非受限模型之間的差異。
原始觀測值和頻率權重¶
[26]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.91343161907935),
np.float64(45.726693322505525),
np.float64(-22.863346661251853))
[27]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.913431618650066),
np.float64(45.726693322507344),
np.float64(-22.863346661253672))
聚合數據:exposure
和 var_weights
¶
注意:LR 檢定與原始觀測值一致,pearson_chi2
不同,且符號錯誤。
[28]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525103445),
np.float64(45.72669332250598),
np.float64(-22.863346661252763))
[29]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510788),
np.float64(45.72669332250621),
np.float64(-22.863346661252763))
研究皮爾森卡方統計量¶
首先,我們做一些健全性檢查,確認 pearson_chi2
和 resid_pearson
的計算沒有基本錯誤。
[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315148),
np.float64(926.1994277566182),
np.float64(894.5809002315148),
np.float64(926.199427756618))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
:5
], res_e.mu[:5]
[31]:
(np.float64(-2.815935518950797e-13),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566182)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
:5
], res_e2.mu[:5]
[33]:
(np.float64(-2.361188168064333e-14),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315148)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832321), np.float64(47104.64779595939))
其中一個導致符號錯誤的可能原因是,我們正在減去除以不同分母的二次項。在一些相關情況下,文獻中的建議是使用公分母。我們可以比較完整模型和簡化模型中使用相同變異數假設的皮爾森卡方統計量。
在這種情況下,我們在所有版本中都獲得了簡化模型和完整模型之間相同的皮爾森卡方縮放差異。(問題 #3616 旨在進一步追蹤此問題。)
[36]:
(
(res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
/ res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121899)
[37]:
(
(res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121923)
[38]:
(
(res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
/ res_f2.model.family.variance(res_f2.mu)
* res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175122017)
[39]:
(
(res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
/ res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.43314175121962)
剩餘部分¶
筆記本的剩餘部分僅包含一些額外檢查,可以忽略。
[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10., 5., 10., 3.]), array([ 1, 10, 5, 10, 3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.664817945858474)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315161),
np.float64(926.199427756624),
np.float64(-42.34720713518717),
np.float64(-32.68238918932538))
[44]:
(
(res_a2._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315161)
[45]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a.mu)
* res_a.model.var_weights
).sum()
[45]:
np.float64(926.199427756624)
[46]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a2.mu)
* res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802967)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17., 9., 4.]), array([ 0., 39., 17., 9., 4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]),
array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17., 9., 4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17., 9., 4.])
[51]:
from scipy import stats
stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994542e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept 3.756282e-141
rate_marriage 0.000000e+00
age 2.221918e-11
yrs_married 1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 127
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -763.61
Date: Thu, 03 Oct 2024 Deviance: 1013.2
Time: 15:45:08 Pearson chi2: 895.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:08 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6363
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10374.
Date: Thu, 03 Oct 2024 Deviance: 15420.
Time: 15:45:08 Pearson chi2: 3.24e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9729
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:08 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================