xtsum
is an R wrapper based on STATA
xtsum
command, it used to provide summary statistics for a
panel data set. It decomposes the variable \(x_{it}\) into a between \((\bar{x_i})\) and within \((x_{it} − \bar{x_i} + \bar{\bar{x}})\), the
global mean x being added back in make results comparable, see (StataCorp 2023).
This function computes summary statistics for panel data, including overall statistics, between-group statistics, and within-group statistics.
Usage
xtsum(
data,
variables = NULL,
id = NULL,
t = NULL,
na.rm = FALSE,
return.data.frame = TRUE,
dec = 3
)
Arguments
data
A data.frame or pdata.frame object representing
panel data.
variables
(Optional) Vector of variable names for
which to calculate statistics. If not provided, all numeric variables in
the data will be used.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical indicating whether to remove NAs when
calculating statistics.
return.data.frame
If the return object should be a
dataframe
dec
Number of significant digits to report
Based on National Longitudinal Survey of Young Women, 14-24 years old in 1968
data("nlswork", package = "sampleSelection")
xtsum(nlswork, "hours", id = "idcode", t = "year", na.rm = T, dec = 6)
Variable | Dim | Mean | SD | Min | Max | Observations |
---|---|---|---|---|---|---|
___________ | _________ | |||||
hours | overall | 36.55956 | 9.869623 | 1 | 168 | N = 28467 |
between | 7.846585 | 1 | 83.5 | n = 4710 | ||
within | 7.520712 | -2.154726 | 130.05956 | T = 6.043949 |
The table above can be interpreted as below paraphrased from (StataCorp 2023).
The overall and within are calculated over N = 28,467
person-years of data. The between is calculated over
n = 4,710
persons, and the average number of years a person
was observed in the hours data isT = 6
.
xtsum
also reports standard deviation(SD
),
minimums(Min
), and maximums(Max
).
Hours worked varied between Overal Min = 1
and
Overall Max = 168
. Average hours worked for each woman
varied between between Min = 1
and
between Max = 83.5
. “Hours worked within” varied between
within Min = −2.15
and within Max = 130.1
,
which is not to say that any woman actually worked negative hours. The
within number refers to the deviation from each individual’s average,
and naturally, some of those deviations must be negative. Then the
negative value is not disturbing but the positive value is. Did some
woman really deviate from her average by +130.1 hours? No. In our
definition of within, we add back in the global average of 36.6 hours.
Some woman did deviate from her average by 130.1 − 36.6 = 93.5 hours,
which is still large.
The reported standard deviations tell us that the variation in hours worked last week across women is nearly equal to that observed within a woman over time. That is, if you were to draw two women randomly from our data, the difference in hours worked is expected to be nearly equal to the difference for the same woman in two randomly selected years.
More detailed interpretation can be found in handout(Porter n.d.)
data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
xtsum(Gas)
Variable | Dim | Mean | SD | Min | Max | Observations |
---|---|---|---|---|---|---|
___________ | _________ | |||||
lgaspcar | overall | 4.296 | 0.549 | 3.38 | 6.157 | N = 342 |
between | 0.515 | 3.73 | 5.766 | n = 18 | ||
within | 0.224 | 3.545 | 5.592 | T = 19 | ||
___________ | _________ | |||||
lincomep | overall | -6.139 | 0.635 | -8.073 | -5.221 | N = 342 |
between | 0.609 | -7.816 | -5.449 | n = 18 | ||
within | 0.225 | -6.877 | -5.6 | T = 19 | ||
___________ | _________ | |||||
lrpmg | overall | -0.523 | 0.678 | -2.896 | 1.125 | N = 342 |
between | 0.684 | -2.709 | 0.739 | n = 18 | ||
within | 0.127 | -1.057 | -0.137 | T = 19 | ||
___________ | _________ | |||||
lcarpcap | overall | -9.042 | 1.219 | -13.475 | -7.536 | N = 342 |
between | 1.114 | -12.459 | -7.781 | n = 18 | ||
within | 0.557 | -11.332 | -7.691 | T = 19 |
data("Crime", package = "plm")
xtsum(Crime, variables = c("polpc", "avgsen", "crmrte"), id = "county", t = "year")
Variable | Dim | Mean | SD | Min | Max | Observations |
---|---|---|---|---|---|---|
___________ | _________ | |||||
polpc | overall | 0.002 | 0.003 | 0 | 0.036 | N = 630 |
between | 0.002 | 0.001 | 0.016 | n = 90 | ||
within | 0.002 | -0.013 | 0.022 | T = 7 | ||
___________ | _________ | |||||
avgsen | overall | 8.955 | 2.658 | 4.22 | 25.83 | N = 630 |
between | 1.498 | 6.277 | 14.581 | n = 90 | ||
within | 2.201 | 1.313 | 20.203 | T = 7 | ||
___________ | _________ | |||||
crmrte | overall | 0.032 | 0.018 | 0.002 | 0.164 | N = 630 |
between | 0.017 | 0.004 | 0.089 | n = 90 | ||
within | 0.007 | -0.011 | 0.126 | T = 7 |
Variable | Dim | Mean | SD | Min | Max | Observations |
---|---|---|---|---|---|---|
___________ | _________ | |||||
lincomep | overall | -6.139 | 0.635 | -8.073 | -5.221 | N = 342 |
between | 0.609 | -7.816 | -5.449 | n = 18 | ||
within | 0.225 | -6.877 | -5.6 | T = 19 | ||
___________ | _________ | |||||
lgaspcar | overall | 4.296 | 0.549 | 3.38 | 6.157 | N = 342 |
between | 0.515 | 3.73 | 5.766 | n = 18 | ||
within | 0.224 | 3.545 | 5.592 | T = 19 |
Returning a data.frame might be useful if one wishes to perform additional manipulation with the data or if you intend to use other rporting packages such as stargazer (Hlavac 2018) or kabel(Zhu 2021).
xtsum(Gas, variables = c("lincomep", "lgaspcar"), return.data.frame = TRUE)
#> # A tibble: 8 × 7
#> Variable Dim Mean SD Min Max Observations
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ___________ _________ <NA> <NA> <NA> <NA> <NA>
#> 2 lincomep overall -6.139 0.635 -8.073 -5.221 N = 342
#> 3 <NA> between <NA> 0.609 -7.816 -5.449 n = 18
#> 4 <NA> within <NA> 0.225 -6.877 -5.6 T = 19
#> 5 ___________ _________ <NA> <NA> <NA> <NA> <NA>
#> 6 lgaspcar overall 4.296 0.549 3.38 6.157 N = 342
#> 7 <NA> between <NA> 0.515 3.73 5.766 n = 18
#> 8 <NA> within <NA> 0.224 3.545 5.592 T = 19
The functions below can serve as a helper when the user is not interested in a full report but rather check a specific value.
This function computes the maximum between-group in a panel data.
Usage
between_max(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments * data
: A data.frame or pdata.frame
object containing the panel data.
variable
: The variable for which the maximum
between-group effect is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
This function computes the minimum between-group of a panel data.
Usage
between_min(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments
data
A data.frame or pdata.frame object containing
the panel data.
variable
The variable for which the minimum
between-group effect is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
Value The minimum between-group effect.
This function calculates the standard deviation of between-group in a panel data.
Usage
between_sd(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments
data
A data.frame or pdata.frame object containing
the panel data.
variable
The variable for which the standard
deviation of between-group effects is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
Value The standard deviation of between-group effects.
This function computes the maximum within-group for a panel data.
Usage
within_max(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments
data
A data.frame or pdata.frame object containing
the panel data.
variable
The variable for which the maximum
within-group effect is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
Value The maximum within-group effect.
This function computes the minimum within-group for a panel data.
Usage
within_min(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments
data
A data.frame or pdata.frame object containing
the panel data.
variable
The variable for which the minimum
within-group effect is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
Value The minimum within-group effect.
This function computes the standard deviation of within-group for a panel data.
Usage
within_sd(data, variable, id = NULL, t = NULL, na.rm = FALSE)
Arguments
data
A data.frame or pdata.frame object containing
the panel data.
variable
The variable for which the standard
deviation of within-group effects is calculated.
id
(Optional) Name of the individual identifier
variable.
t
(Optional) Name of the time identifier
variable.
na.rm
Logical. Should missing values be removed?
Default is FALSE.
Value The standard deviation of within-group effects.