資料集套件¶

statsmodels 提供資料集（即資料 *以及* 元資料），用於範例、教學課程、模型測試等。

使用來自 Stata 的資料集¶

`webuse`(data[, baseurl, as_df])	從 Stata 下載並傳回範例資料集。

使用來自 R 的資料集¶

Rdatasets 專案可以存取 R 核心資料集套件和許多其他常見 R 套件中可用的資料集。所有這些資料集都可以透過使用 get_rdataset 函式來提供給 statsmodels。實際資料可以透過 data 屬性存取。例如

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")

In [3]: print(duncan_prestige.__doc__)
.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
rownames                                     
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

R 資料集函式參考¶

`get_rdataset`(dataname[, package, cache])	下載並傳回 R 資料集
`get_data_home`([data_home])	傳回 statsmodels 資料目錄的路徑。
`clear_data_home`([data_home])	刪除資料目錄快取中的所有內容。

可用的資料集¶

用法¶

載入資料集

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load_pandas()

Dataset 物件遵循 bunch 模式。完整的資料集可在 data 屬性中取得。

In [7]: data.data
Out[7]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

大多數資料集在 endog 和 exog 屬性中保留資料的方便表示法

In [8]: data.endog.iloc[:5]
Out[8]: 
0    60323.0
1    61122.0
2    60171.0
3    61187.0
4    63221.0
Name: TOTEMP, dtype: float64

In [9]: data.exog.iloc[:5,:]
Out[9]: 
   GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4     96.2  328975.0  2099.0  3099.0  112075.0  1951.0

然而，單變數資料集沒有 exog 屬性。

可以透過輸入來取得變數名稱

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

如果資料集沒有明確的解釋，說明什麼應該是 endog 和 exog，那麼您可以隨時存取 data 或 raw_data 屬性。對於 macrodata 資料集來說就是這種情況，它是一個美國總體經濟資料的集合，而不是一個具有特定範例的資料集。data 屬性包含完整資料集的記錄陣列，而 raw_data 屬性包含一個 ndarray，其欄位名稱由 names 屬性給定。

In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame

In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

將資料載入為 pandas 物件¶

對於許多使用者來說，最好以 pandas DataFrame 或 Series 物件的形式取得資料集。每個資料集模組都配備一個 load_pandas 方法，該方法傳回一個 Dataset 實例，其中資料可以隨時以 pandas 物件的形式使用

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
    83.0  234289.0  2356.0  1590.0  107608.0  1947.0
    88.5  259426.0  2325.0  1456.0  108632.0  1948.0
    88.2  258054.0  3682.0  1616.0  109773.0  1949.0
    89.5  284599.0  3351.0  1650.0  110929.0  1950.0
    96.2  328975.0  2099.0  3099.0  112075.0  1951.0
    98.1  346999.0  1932.0  3594.0  113270.0  1952.0
    99.0  365385.0  1870.0  3547.0  115094.0  1953.0
   100.0  363112.0  3578.0  3350.0  116219.0  1954.0
   101.2  397469.0  2904.0  3048.0  117388.0  1955.0
   104.6  419180.0  2822.0  2857.0  118734.0  1956.0
  108.4  442769.0  2936.0  2798.0  120445.0  1957.0
  110.8  444546.0  4681.0  2637.0  121950.0  1958.0
  112.6  482704.0  3813.0  2552.0  123366.0  1959.0
  114.2  502601.0  3931.0  2514.0  125368.0  1960.0
  115.7  518173.0  4806.0  2572.0  127852.0  1961.0
  116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
   60323.0
   61122.0
   60171.0
   61187.0
   63221.0
   63639.0
   64989.0
   63761.0
   66019.0
   67857.0
  68169.0
  66513.0
  68655.0
  69564.0
  69331.0
  70551.0
Name: TOTEMP, dtype: float64

完整的 DataFrame 可在 Dataset 物件的 data 屬性中取得

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

透過估計類別中的 pandas 整合，元資料將附加到模型結果中

In [19]: y, x = data.endog, data.exog

In [20]: res = sm.OLS(y, x).fit()

In [21]: res.params
Out[21]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TOTEMP   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          5.052e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):                    8.20e-22
Time:                        16:08:41   Log-Likelihood:                         -117.56
No. Observations:                  16   AIC:                                      247.1
Df Residuals:                      10   BIC:                                      251.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691    -341.638     235.650
GNP            0.0711      0.030      2.356      0.040       0.004       0.138
UNEMP         -0.4235      0.418     -1.014      0.335      -1.354       0.507
ARMED         -0.5726      0.279     -2.052      0.067      -1.194       0.049
POP           -0.4142      0.321     -1.289      0.226      -1.130       0.302
YEAR          48.4179     17.689      2.737      0.021       9.003      87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

額外資訊¶

如果您想了解更多關於資料集本身的信息，您可以存取以下內容，同樣以 Longley 資料集為例

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

其他資訊¶

資料集套件的想法最初由 David Cournapeau 提出。
若要新增資料集，請參閱關於新增資料集的注意事項。

上次更新：2024 年 10 月 03 日