Basically, this package complements the dplyr package in that sjmisc takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. The data transformation functions also support labelled data.
The design of data transformation functions in this package follows, where appropriate, the tidyverse-approach, with the first argument of a function always being the data (either a data frame or vector), followed by variable names that should be processed by the function. If no variables are specified as argument, the function applies to the complete data that was indicated as first function argument.
A major difference to dplyr-functions like select()
or
filter()
is that the data-argument (the first argument of
each function), may either be a data frame or a
vector. The returned object for each function equals the
type of the data-argument:
library(sjmisc)
data(efc)
# returns a vector
x <- rec(efc$e42dep, rec = "1,2=1; 3,4=2")
str(x)
#> num [1:908] 2 2 2 2 2 2 2 2 2 2 ...
#> - attr(*, "label")= chr "elder's dependency"
# returns a data frame
rec(efc, e42dep, rec = "1,2=1; 3,4=2", append = FALSE) %>% head()
#> e42dep_r
#> 1 2
#> 2 2
#> 3 2
#> 4 2
#> 5 2
#> 6 2
This design-choice is mainly due to compatibility- and convenience-reasons. It does not affect the usual “tidyverse-workflow” or when using pipe-chains.
The selection of variables specified in the
...
-ellipses-argument is powered by dplyr’s
select()
and tidyselect’s select_helpers()
.
This means, you can use existing functions like :
to select
a range of variables, or also use tidyselect’s
select_helpers
, like contains()
or
one_of()
.
# select all variables with "cop" in their names, and also
# the range from c161sex to c175empl
rec(
efc, contains("cop"), c161sex:c175empl,
rec = "0,1=0; else=1",
append = FALSE
) %>% head()
#> c82cop1_r c83cop2_r c84cop3_r c85cop4_r c86cop5_r c87cop6_r c88cop7_r
#> 1 1 1 1 1 0 0 1
#> 2 1 1 1 1 1 0 1
#> 3 1 1 0 1 0 0 0
#> 4 1 0 1 0 0 0 0
#> 5 1 1 0 1 1 1 0
#> 6 1 1 1 1 1 1 1
#> c89cop8_r c90cop9_r c161sex_r c172code_r c175empl_r
#> 1 1 1 1 1 0
#> 2 1 1 1 1 0
#> 3 1 1 0 0 0
#> 4 1 1 0 1 0
#> 5 1 1 1 1 0
#> 6 0 0 0 1 0
# center all variables with "age" in name, variable c12hour
# and all variables from column 19 to 21
center(efc, c12hour, contains("age"), 19:21, append = FALSE) %>% head()
#> c12hour_c e17age_c c160age_c barthtot_c neg_c_7_c pos_v_4_c
#> 1 -26.39911 3.878788 2.5371809 10.453001 0.1502242 -0.476731
#> 2 105.60089 8.878788 0.5371809 10.453001 8.1502242 -1.476731
#> 3 27.60089 2.878788 26.5371809 -29.546999 -0.8497758 0.523269
#> 4 125.60089 -12.121212 15.5371809 -64.546999 -1.8497758 2.523269
#> 5 125.60089 4.878788 -6.4628191 -39.546999 0.1502242 2.523269
#> 6 -26.39911 5.878788 2.5371809 -4.546999 7.1502242 -3.476731
There are two types of function designs:
Functions like to_factor()
or to_label()
,
which convert variables into other types or add additional information
like variable or value labels as attribute, typically return the
complete data frame that was given as first argument without
any new variables. The variables specified in the
...
-ellipses argument are converted (overwritten), all
other variables remain unchanged.
x <- efc[, 3:5]
x %>% str()
#> 'data.frame': 908 obs. of 3 variables:
#> $ e16sex: num 2 2 2 2 2 2 1 2 2 2 ...
#> ..- attr(*, "label")= chr "elder's gender"
#> ..- attr(*, "labels")= Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "male" "female"
#> $ e17age: num 83 88 82 67 84 85 74 87 79 83 ...
#> ..- attr(*, "label")= chr "elder' age"
#> $ e42dep: num 3 3 3 4 4 4 4 4 4 4 ...
#> ..- attr(*, "label")= chr "elder's dependency"
#> ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#> .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
to_factor(x, e42dep, e16sex) %>% str()
#> 'data.frame': 908 obs. of 3 variables:
#> $ e16sex: Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 2 2 2 ...
#> ..- attr(*, "labels")= Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "male" "female"
#> ..- attr(*, "label")= chr "elder's gender"
#> $ e17age: num 83 88 82 67 84 85 74 87 79 83 ...
#> ..- attr(*, "label")= chr "elder' age"
#> $ e42dep: Factor w/ 4 levels "1","2","3","4": 3 3 3 4 4 4 4 4 4 4 ...
#> ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#> .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
#> ..- attr(*, "label")= chr "elder's dependency"
Functions like rec()
or dicho()
, which
transform or recode variables, by default add the transformed or
recoded variables to the data frame, so they return the new
variables and the original data as combined data frame. To
return only the transformed and recoded variables specified in
the ...
-ellipses argument, use argument
append = FALSE
.
# complete data, including new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE) %>% head()
#> c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4 c86cop5
#> 1 16 2 2 83 3 3 2 2 2 1
#> 2 148 2 2 88 3 3 3 3 3 4
#> 3 70 1 2 82 3 2 2 1 4 1
#> 4 168 1 2 67 4 4 1 3 1 1
#> 5 168 2 2 84 4 3 2 1 2 2
#> 6 16 2 2 85 4 2 2 3 3 3
#> c87cop6 c88cop7 c89cop8 c90cop9 c160age c161sex c172code c175empl barthtot
#> 1 1 2 3 3 56 2 2 1 75
#> 2 1 3 2 2 54 2 2 1 75
#> 3 1 1 4 3 80 1 1 0 35
#> 4 1 1 2 4 69 1 2 0 0
#> 5 2 1 4 4 47 2 2 0 25
#> 6 2 2 1 1 56 1 2 1 60
#> neg_c_7 pos_v_4 quol_5 resttotn tot_sc_e n4pstu nur_pst c82cop1_r c83cop2_r
#> 1 12 12 14 0 4 0 NA 2 0
#> 2 20 11 10 4 0 0 NA 2 2
#> 3 11 13 7 0 1 2 2 0 0
#> 4 10 15 12 2 0 3 3 2 0
#> 5 12 15 19 2 1 2 2 2 0
#> 6 19 9 8 1 3 2 2 0 0
# only new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>% head()
#> c82cop1_r c83cop2_r
#> 1 2 0
#> 2 2 2
#> 3 0 0
#> 4 2 0
#> 5 2 0
#> 6 0 0
These variables usually get a suffix, so you can bind these variables
as new columns to a data frame, for instance with
add_columns()
. The function add_columns()
is
useful if you want to bind/add columns within a pipe-chain to the
end of a data frame.
efc %>%
rec(c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>%
add_columns(efc) %>%
head()
#> c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4 c86cop5
#> 1 16 2 2 83 3 3 2 2 2 1
#> 2 148 2 2 88 3 3 3 3 3 4
#> 3 70 1 2 82 3 2 2 1 4 1
#> 4 168 1 2 67 4 4 1 3 1 1
#> 5 168 2 2 84 4 3 2 1 2 2
#> 6 16 2 2 85 4 2 2 3 3 3
#> c87cop6 c88cop7 c89cop8 c90cop9 c160age c161sex c172code c175empl barthtot
#> 1 1 2 3 3 56 2 2 1 75
#> 2 1 3 2 2 54 2 2 1 75
#> 3 1 1 4 3 80 1 1 0 35
#> 4 1 1 2 4 69 1 2 0 0
#> 5 2 1 4 4 47 2 2 0 25
#> 6 2 2 1 1 56 1 2 1 60
#> neg_c_7 pos_v_4 quol_5 resttotn tot_sc_e n4pstu nur_pst c82cop1_r c83cop2_r
#> 1 12 12 14 0 4 0 NA 2 0
#> 2 20 11 10 4 0 0 NA 2 2
#> 3 11 13 7 0 1 2 2 0 0
#> 4 10 15 12 2 0 3 3 2 0
#> 5 12 15 19 2 1 2 2 2 0
#> 6 19 9 8 1 3 2 2 0 0
If append = TRUE
and suffix = ""
, recoded
variables will replace (overwrite) existing variables.
# complete data, existing columns c82cop1 and c83cop2 are replaced
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE, suffix = "") %>% head()
#> c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4 c86cop5
#> 1 16 2 2 83 3 2 0 2 2 1
#> 2 148 2 2 88 3 2 2 3 3 4
#> 3 70 1 2 82 3 0 0 1 4 1
#> 4 168 1 2 67 4 2 0 3 1 1
#> 5 168 2 2 84 4 2 0 1 2 2
#> 6 16 2 2 85 4 0 0 3 3 3
#> c87cop6 c88cop7 c89cop8 c90cop9 c160age c161sex c172code c175empl barthtot
#> 1 1 2 3 3 56 2 2 1 75
#> 2 1 3 2 2 54 2 2 1 75
#> 3 1 1 4 3 80 1 1 0 35
#> 4 1 1 2 4 69 1 2 0 0
#> 5 2 1 4 4 47 2 2 0 25
#> 6 2 2 1 1 56 1 2 1 60
#> neg_c_7 pos_v_4 quol_5 resttotn tot_sc_e n4pstu nur_pst
#> 1 12 12 14 0 4 0 NA
#> 2 20 11 10 4 0 0 NA
#> 3 11 13 7 0 1 2 2
#> 4 10 15 12 2 0 3 3
#> 5 12 15 19 2 1 2 2
#> 6 19 9 8 1 3 2 2
The functions of sjmisc are designed to work
together seamlessly with other packages from the tidyverse, like
dplyr. For instance, you can use the functions from
sjmisc both within a pipe-worklflow to manipulate data
frames, or to create new variables with mutate()
:
efc %>%
select(c82cop1, c83cop2) %>%
rec(rec = "1,2=0; 3:4=2") %>%
head()
#> c82cop1 c83cop2 c82cop1_r c83cop2_r
#> 1 3 2 2 0
#> 2 3 3 2 2
#> 3 2 2 0 0
#> 4 4 1 2 0
#> 5 3 2 2 0
#> 6 2 2 0 0
efc %>%
select(c82cop1, c83cop2) %>%
mutate(
c82cop1_dicho = rec(c82cop1, rec = "1,2=0; 3:4=2"),
c83cop2_dicho = rec(c83cop2, rec = "1,2=0; 3:4=2")
) %>%
head()
#> c82cop1 c83cop2 c82cop1_dicho c83cop2_dicho
#> 1 3 2 2 0
#> 2 3 3 2 2
#> 3 2 2 0 0
#> 4 4 1 2 0
#> 5 3 2 2 0
#> 6 2 2 0 0
This makes it easy to adapt the sjmisc functions to your own workflow.