The General Social Survey Cumulative Data (1972-2021) and Three Wave Panel Data files packaged for easy use in R.
gssr
is a data package, bundling several datasets into a convenient format. The relatively large size of the data in the package means it is not suitable for hosting on CRAN, the core R package repository. There are two ways to install it.
You can install the beta version of gssr from GitHub with:
remotes::install_github("kjhealy/gssr")
drat
While using install_github()
works just fine, it would be nicer to be able to just type install.packages("gssr")
or update.packages("gssr")
in the ordinary way. We can do this using Dirk Eddelbuettel’s drat package. Drat provides a convenient way to make R aware of package repositories other than CRAN.
First, install drat
:
if (!require("drat")) {
install.packages("drat")
library("drat")
}
Then use drat
to tell R about the repository where gssr
is hosted:
drat::addRepo("kjhealy")
You can now install gssr
:
install.packages("gssr")
To ensure that the gssr
repository is always available, you can add the following line to your .Rprofile
or .Rprofile.site
file:
drat::addRepo("kjhealy")
With that in place you’ll be able to do install.packages("gssr")
or update.packages("gssr")
and have everything work as you’d expect.
Note that the drat repository only contains data packages that are not on CRAN, so you will never be in danger of grabbing the wrong version of any other package.
library(gssr)
#> Package loaded. To attach the GSS data, type data(gss_all) at the console.
#> For the codebook, type data(gss_doc).
#> For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
You can quickly get the data for any single GSS year by using gss_get_yr()
to download the data file from NORC and put it directly into a tibble.
gss18 <- gss_get_yr(2018)
#> Fetching: https://gss.norc.org/documents/stata/2018_stata.zip
gss18
#> # A tibble: 2,348 × 1,065
#> abany abdefect abfelegl abhelp1 abhelp2 abhelp3 abhelp4
#> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+l> <dbl+l>
#> 1 2 [no] 1 [yes] NA(i) [IAP] 1 [yes] 1 [yes] 1 [yes] 1 [yes]
#> 2 1 [yes] 1 [yes] 3 [it depends] 2 [no] 2 [no] 2 [no] 2 [no]
#> 3 NA(i) [IAP] NA(i) [IAP] NA(i) [IAP] 1 [yes] 2 [no] 1 [yes] 1 [yes]
#> 4 NA(i) [IAP] NA(i) [IAP] 1 [should] 1 [yes] 1 [yes] 1 [yes] 1 [yes]
#> 5 2 [no] 1 [yes] NA(i) [IAP] 2 [no] 2 [no] 2 [no] 1 [yes]
#> 6 1 [yes] 1 [yes] 1 [should] 1 [yes] 1 [yes] 1 [yes] 1 [yes]
#> 7 1 [yes] 1 [yes] 3 [it depends] 1 [yes] 2 [no] 1 [yes] 1 [yes]
#> 8 2 [no] 1 [yes] NA(i) [IAP] 1 [yes] 2 [no] 1 [yes] 1 [yes]
#> 9 NA(i) [IAP] NA(i) [IAP] 3 [it depends] 1 [yes] 1 [yes] 1 [yes] 1 [yes]
#> 10 NA(i) [IAP] NA(i) [IAP] NA(i) [IAP] 1 [yes] 2 [no] 2 [no] 1 [yes]
#> # … with 2,338 more rows, and 1,058 more variables: abhlth <dbl+lbl>,
#> # abinspay <dbl+lbl>, abmedgov1 <dbl+lbl>, abmedgov2 <dbl+lbl>,
#> # abmelegl <dbl+lbl>, abmoral <dbl+lbl>, abnomore <dbl+lbl>,
#> # abpoor <dbl+lbl>, abpoorw <dbl+lbl>, abrape <dbl+lbl>, absingle <dbl+lbl>,
#> # abstate1 <dbl+lbl>, abstate2 <dbl+lbl>, acqntsex <dbl+lbl>,
#> # actssoc <dbl+lbl>, adminconsent <dbl+lbl>, adults <dbl+lbl>,
#> # advfront <dbl+lbl>, affrmact <dbl+lbl>, afraidof <dbl+lbl>, …
The GSS cumulative data file is large. It is not loaded by default when you invoke the package. (That is, gssr
does not use R’s “lazy loading” facility. The data file is too big to do this without error.) To load one of the datasets, first load the library and then use data()
to make the data available. For example, load the cumulative GSS file like this:
data(gss_all)
This will take a moment. Once it is ready, the gss_all
object is available to use in the usual way:
gss_all
#> # A tibble: 68,846 × 6,311
#> year id wrkstat hrs1 hrs2 evwork occ prestige wrkslf wrkgovt
#> <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl+l> <dbl+l>
#> 1 1972 1 1 [workin… NA(i) NA(i) NA(i) 205 50 2 [som… NA(i)
#> 2 1972 2 5 [retire… NA(i) NA(i) 1 [yes] 441 45 2 [som… NA(i)
#> 3 1972 3 2 [workin… NA(i) NA(i) NA(i) 270 44 2 [som… NA(i)
#> 4 1972 4 1 [workin… NA(i) NA(i) NA(i) 1 57 2 [som… NA(i)
#> 5 1972 5 7 [keepin… NA(i) NA(i) 1 [yes] 385 40 2 [som… NA(i)
#> 6 1972 6 1 [workin… NA(i) NA(i) NA(i) 281 49 2 [som… NA(i)
#> 7 1972 7 1 [workin… NA(i) NA(i) NA(i) 522 41 2 [som… NA(i)
#> 8 1972 8 1 [workin… NA(i) NA(i) NA(i) 314 36 2 [som… NA(i)
#> 9 1972 9 2 [workin… NA(i) NA(i) NA(i) 912 26 2 [som… NA(i)
#> 10 1972 10 1 [workin… NA(i) NA(i) NA(i) 984 18 2 [som… NA(i)
#> # … with 68,836 more rows, and 6,301 more variables: commute <dbl>,
#> # industry <dbl>, occ80 <dbl>, prestg80 <dbl>, indus80 <dbl+lbl>,
#> # indus07 <dbl>, occonet <dbl>, found <dbl>, occ10 <dbl+lbl>, occindv <dbl>,
#> # occstatus <dbl>, occtag <dbl>, prestg10 <dbl>, prestg105plus <dbl>,
#> # indus10 <dbl+lbl>, indstatus <dbl>, indtag <dbl>, marital <dbl+lbl>,
#> # martype <dbl+lbl>, agewed <dbl>, divorce <dbl+lbl>, widowed <dbl+lbl>,
#> # spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, …
The variables are documented in two supplementary tibbles, gss_doc
and gss_dict
. To load gss_doc
, do this:
data(gss_doc)
gss_doc
#> # A tibble: 6,144 × 5
#> id description properties marginals text
#> <chr> <chr> <list> <list> <chr>
#> 1 caseid YEAR + Respondent ID <tibble [2 × 3]> <tibble [1 × 3]> None
#> 2 year GSS year for this respondent <tibble [2 × 3]> <tibble> None
#> 3 id Respondent ID number <tibble [2 × 3]> <tibble [1 × 3]> None
#> 4 age Age of respondent <tibble [3 × 3]> <tibble [1 × 3]> 13. …
#> 5 sex Respondents sex <tibble [3 × 3]> <tibble [3 × 5]> 23. …
#> 6 race Race of respondent <tibble [3 × 3]> <tibble [4 × 5]> 24. …
#> 7 racecen1 What Is R's race 1st mention <tibble [3 × 3]> <tibble> 1602…
#> 8 racecen2 What Is R's race 2nd mention <tibble [3 × 3]> <tibble> 1602…
#> 9 racecen3 What Is R's race 3rd mention <tibble [3 × 3]> <tibble> 1602…
#> 10 hispanic Hispanic specified <tibble [3 × 3]> <tibble> 1601…
#> # … with 6,134 more rows
You can take a look at information on a particular variable by doing something like this:
gss_doc %>% filter(id == "race") %>%
select(id, description, text)
#> # A tibble: 1 × 3
#> id description text
#> <chr> <chr> <chr>
#> 1 race Race of respondent 24. What race do you consider yourself?
To look at a variable’s marginals or its properties, use unnest()
:
gss_doc %>% filter(id == "race") %>%
select(marginals) %>%
unnest(cols = c(marginals))
#> # A tibble: 4 × 5
#> percent n value label id
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 80.3 52,033 1 WHITE RACE
#> 2 14.2 9,187 2 BLACK RACE
#> 3 5.5 3,594 3 OTHER RACE
#> 4 100 64,814 <NA> Total RACE
gss_doc %>% filter(id == "race") %>%
select(properties) %>%
unnest(cols = c(properties))
#> # A tibble: 3 × 3
#> property value id
#> <chr> <chr> <chr>
#> 1 Data type numeric RACE
#> 2 Missing-data code 0 RACE
#> 3 Record/column 1/298 RACE
There are convenience functions to do this as well, for one or more categorical variables. One for the marginals:
gss_get_marginals(varnames = c("race", "sex"))
#> # A tibble: 7 × 6
#> variable percent n value label id
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 sex 44.1 28614 1 MALE SEX
#> 2 sex 55.9 36200 2 FEMALE SEX
#> 3 sex 100 64814 <NA> Total SEX
#> 4 race 80.3 52033 1 WHITE RACE
#> 5 race 14.2 9187 2 BLACK RACE
#> 6 race 5.5 3594 3 OTHER RACE
#> 7 race 100 64814 <NA> Total RACE
And one for the properties:
gss_get_props(varnames = c("race", "sex"))
#> # A tibble: 6 × 4
#> variable property value id
#> <chr> <chr> <chr> <chr>
#> 1 sex Data type numeric SEX
#> 2 sex Missing-data code 0 SEX
#> 3 sex Record/column 1/297 SEX
#> 4 race Data type numeric RACE
#> 5 race Missing-data code 0 RACE
#> 6 race Record/column 1/298 RACE
The package also comes with gss_dict
, a tibble with similar information in a slightly different format:
data(gss_dict)
gss_dict
#> # A tibble: 2,469 × 6
#> pos variable label col_type value_labels years
#> <int> <chr> <chr> <chr> <chr> <list>
#> 1 1 wrkstat labor force status dbl+lbl [1] working… <tibble>
#> 2 2 hrs1 number of hours worked last we… dbl+lbl [89] 80+ ho… <tibble>
#> 3 3 hrs2 number of hours usually work a… dbl+lbl [89] 80+ ho… <tibble>
#> 4 4 evwork ever work as long as one year dbl+lbl [1] yes; [2… <tibble>
#> 5 5 wrkslf r self-emp or works for somebo… dbl+lbl [1] self-em… <tibble>
#> 6 6 wrkgovt govt or private employee dbl+lbl [1] governm… <tibble>
#> 7 7 indus80 r's industry code (1980) dbl+lbl [1] strongl… <tibble>
#> 8 8 occ10 r's census occupation code (20… dbl+lbl [10] chief … <tibble>
#> 9 9 indus10 r's industry code (naics 2007) dbl+lbl [170] crop … <tibble>
#> 10 10 marital marital status dbl+lbl [1] married… <tibble>
#> # … with 2,459 more rows
We often want to know which years a question or group of questions was asked. We can find this out for one or more variables with gss_which_years()
.
gss_which_years(gss_all, fefam)
#> # A tibble: 33 x 2
#> year fefam
#> <dbl> <lgl>
#> 1 1972 FALSE
#> 2 1973 FALSE
#> 3 1974 FALSE
#> 4 1975 FALSE
#> 5 1976 FALSE
#> 6 1977 TRUE
#> 7 1978 FALSE
#> 8 1980 FALSE
#> 9 1982 FALSE
#> 10 1983 FALSE
#> # … with 22 more rows
When querying more than one variable, use c()
:
gss_all %>%
gss_which_years(c(industry, indus80, wrkgovt, commute)) %>%
print(n = Inf)
#> # A tibble: 33 x 5
#> year industry indus80 wrkgovt commute
#> <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1 1972 TRUE FALSE FALSE FALSE
#> 2 1973 TRUE FALSE FALSE FALSE
#> 3 1974 TRUE FALSE FALSE FALSE
#> 4 1975 TRUE FALSE FALSE FALSE
#> 5 1976 TRUE FALSE FALSE FALSE
#> 6 1977 TRUE FALSE FALSE FALSE
#> 7 1978 TRUE FALSE FALSE FALSE
#> 8 1980 TRUE FALSE FALSE FALSE
#> 9 1982 TRUE FALSE FALSE FALSE
#> 10 1983 TRUE FALSE FALSE FALSE
#> 11 1984 TRUE FALSE FALSE FALSE
#> 12 1985 TRUE FALSE TRUE FALSE
#> 13 1986 TRUE FALSE TRUE TRUE
#> 14 1987 TRUE FALSE FALSE FALSE
#> 15 1988 TRUE TRUE FALSE FALSE
#> 16 1989 TRUE TRUE FALSE FALSE
#> 17 1990 TRUE TRUE FALSE FALSE
#> 18 1991 FALSE TRUE FALSE FALSE
#> 19 1993 FALSE TRUE FALSE FALSE
#> 20 1994 FALSE TRUE FALSE FALSE
#> 21 1996 FALSE TRUE FALSE FALSE
#> 22 1998 FALSE TRUE FALSE FALSE
#> 23 2000 FALSE TRUE TRUE FALSE
#> 24 2002 FALSE TRUE TRUE FALSE
#> 25 2004 FALSE TRUE TRUE FALSE
#> 26 2006 FALSE TRUE TRUE FALSE
#> 27 2008 FALSE TRUE TRUE FALSE
#> 28 2010 FALSE TRUE TRUE FALSE
#> 29 2012 FALSE FALSE TRUE FALSE
#> 30 2014 FALSE FALSE TRUE FALSE
#> 31 2016 FALSE FALSE TRUE FALSE
#> 32 2018 FALSE FALSE TRUE FALSE
#> 33 2021 FALSE FALSE TRUE FALSE
The GSS administrators have released a Methodological Primer along with the Documentation and Codebook for the 2021 survey that users should read carefully in connection with the effects of COVID-19 on data collection for the GSS.
The Primer notes:
Since its inception, the GSS has conducted data collection via in-person interviews as its primary mode of data collection. The pandemic forced the GSS to change this design, moving from in-person to address- based sampling and a push-to-web methodology, with the bulk of the interview conducted online via a self- administered questionnaire.
In addition,
We recommend our users include the one of the following statements when reporting on the GSS 2021 Cross-section data: Total Survey Error Summary Perspective for the 2021 GSS Cross-section: Changes in opinions, attitudes, and behaviors observed in 2021 relative to historical trends may be due to actual change in concept over time and/or may have resulted from methodological changes made to the survey methodology during the COVID-19 global pandemic.
And,
Suggested Statement to Include in Articles and Reports That Use GSS Data: To safeguard the health of staff and respondents during the COVID-19 pandemic, the 2021 GSS data collection used a mail-to-web methodology instead of its traditional in-person interviews. Research and interpretation done using the data should take extra care to ensure the analysis reflects actual changes in public opinion and is not unduly influenced by the change in data collection methods. For more information on the 2021 GSS methodology and its implications, please visit https://gss.norc.org/Get-The-Data
In addition to the Cumulative Data File, the gssr package also includes the GSS’s panel data. The current rotating panel design began in 2006. A panel of respondents were interviewed that year and followed up on for further interviews in 2008 and 2010. A second panel was interviewed beginning in 2008, and was followed up on for further interviews in 2010 and 2012. And a third panel began in 2010, with follow-up interviews in 2012 and 2014. The gssr
package provides three datasets, one for each of three-wave panels. They are gss_panel06_long
, gss_panel08_long
, and gss_panel10_long
. The datasets are provided by the GSS in wide format but (as their names suggest) they are packaged here in long format. The conversion was carried out using the panelr
package and its long_panel()
function. Conversion from long back to wide format is possible with the tools provided in panelr
.
The panel data objects must be loaded in the same way as the cumulative data file.
data("gss_panel06_long")
gss_panel06_long
#> # A tibble: 6,000 × 1,572
#> firstid wave ballot form formwt oversamp sampcode sample samptype
#> <fct> <dbl> <dbl+lbl> <dbl+l> <dbl> <dbl> <dbl+lb> <dbl+l> <dbl+lbl>
#> 1 9 1 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 2 9 2 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 3 9 3 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 4 10 1 1 [BALLOT … 1 [STA… 1 1 501 9 [200… 2006 [200…
#> 5 10 2 1 [BALLOT … 1 [STA… 1 1 501 9 [200… 2006 [200…
#> 6 10 3 1 [BALLOT … 1 [STA… 1 1 501 9 [200… 2006 [200…
#> 7 11 1 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 8 11 2 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 9 11 3 3 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> 10 12 1 1 [BALLOT … 2 [ALT… 1 1 501 9 [200… 2006 [200…
#> # … with 5,990 more rows, and 1,563 more variables: vstrat <dbl+lbl>,
#> # vpsu <dbl+lbl>, wtpan12 <dbl+lbl>, wtpan123 <dbl+lbl>, wtpannr12 <dbl+lbl>,
#> # wtpannr123 <dbl+lbl>, letin1a <dbl+lbl>, abany <dbl+lbl>,
#> # abdefect <dbl+lbl>, abhlth <dbl+lbl>, abnomore <dbl+lbl>, abpoor <dbl+lbl>,
#> # abrape <dbl+lbl>, absingle <dbl+lbl>, accntsci <dbl+lbl>,
#> # acqasian <dbl+lbl>, acqattnd <dbl+lbl>, acqblack <dbl+lbl>,
#> # acqbrnda <dbl+lbl>, acqchild <dbl+lbl>, acqcohab <dbl+lbl>, …
Although the panel data objects were created by panelr
, they are regular tibbles. You do not need to use panelr
to work with the data.
The column names in long format do not have wave identifiers. Rather, firstid
and wave
variables track the cases. The firstid
variable is unique for every respondent in the panel and has no missing values. The wave
variable indexes responses from a given firstid
panelist in each wave (if observed). The id
variable is from the GSS and indexes individuals within waves.
data("gss_panel08_long")
gss_panel08_long %>%
select(firstid, wave, id, sex)
#> # A tibble: 6,069 × 4
#> firstid wave id sex
#> <fct> <dbl> <dbl+lbl> <dbl+lbl>
#> 1 1 1 1 1 [MALE]
#> 2 1 2 8001 1 [MALE]
#> 3 1 3 NA NA
#> 4 2 1 2 1 [MALE]
#> 5 2 2 8002 1 [MALE]
#> 6 2 3 8001 1 [MALE]
#> 7 3 1 3 1 [MALE]
#> 8 3 2 8003 1 [MALE]
#> 9 3 3 8002 1 [MALE]
#> 10 4 1 4 1 [MALE]
#> # … with 6,059 more rows
We can look at attrition across waves with, e.g.:
gss_panel06_long %>%
select(wave, id) %>%
group_by(wave) %>%
summarize(observed = n_distinct(id),
missing = sum(is.na(id)))
#> # A tibble: 3 × 3
#> wave observed missing
#> <dbl> <int> <int>
#> 1 1 2000 0
#> 2 2 1537 464
#> 3 3 1277 724
The documentation tibble for the panel data is called gss_panel_doc
.
data("gss_panel_doc")
gss_panel_doc
#> # A tibble: 628 × 9
#> id description text properties_1 properties_2 properties_3 marginals_1
#> <chr> <chr> <chr> <list> <list> <list> <list>
#> 1 caseid CASEID None <tibble> <NULL> <NULL> <tibble>
#> 2 year YEAR None <tibble> <tibble> <tibble> <tibble>
#> 3 id ID None <tibble> <tibble> <tibble> <tibble>
#> 4 age AGE 13. … <tibble> <tibble> <tibble> <tibble>
#> 5 sex SEX 23. … <tibble> <tibble> <tibble> <tibble>
#> 6 race RACE 24. … <tibble> <tibble> <tibble> <tibble>
#> 7 racecen1 RACECEN1 1602… <tibble> <tibble> <tibble> <tibble>
#> 8 racecen2 RACECEN2 1602… <tibble> <tibble> <tibble> <tibble>
#> 9 racecen3 RACECEN3 1602… <tibble> <tibble> <tibble> <tibble>
#> 10 hispanic HISPANIC 1601… <tibble> <tibble> <tibble> <tibble>
#> # … with 618 more rows, and 2 more variables: marginals_2 <list>,
#> # marginals_3 <list>
Each row is a variable. The id
, description
, and text
columns provide the details on each question or measure. The properties
and marginals
are provided in the remaining columns, with a suffix indicating the wave. The categorical variables in the panel codebook can be queried in the same way as those in the cumulative codebook. We specify that we want to look at gss_panel_doc
rather than gss_doc
and we say which property wave or marginals wave we want to see.
gss_get_marginals(varnames = c("sex", "race"), data = gss_panel_doc, margin = marginals_2)
#> # A tibble: 9 × 6
#> variable percent n value label id
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 sex 41.7 640 "1" MALE SEX_2
#> 2 sex 58.3 896 "2" FEMALE SEX_2
#> 3 sex NA 464 "." (Does not apply) SEX_2
#> 4 sex 100 2000 "" Total SEX_2
#> 5 race 78.6 1208 "1" WHITE RACE_2
#> 6 race 13.7 210 "2" BLACK RACE_2
#> 7 race 7.7 118 "3" OTHER RACE_2
#> 8 race NA 464 "0" IAP RACE_2
#> 9 race 100 2000 <NA> Total RACE_2
gss_get_marginals(varnames = "padeg", data = gss_panel_doc, margin = marginals_1)
#> # A tibble: 9 × 6
#> variable percent n value label id
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 padeg 38.3 602 0 LT HIGH SCHOOL PADEG_1
#> 2 padeg 40.4 635 1 HIGH SCHOOL PADEG_1
#> 3 padeg 2.7 43 2 JUNIOR COLLEGE PADEG_1
#> 4 padeg 10.6 167 3 BACHELOR PADEG_1
#> 5 padeg 7.9 124 4 GRADUATE PADEG_1
#> 6 padeg NA 355 7 IAP PADEG_1
#> 7 padeg NA 60 8 DK PADEG_1
#> 8 padeg NA 14 9 <NA> PADEG_1
#> 9 padeg 100 2000 <NA> Total PADEG_1
The package is documented at http://kjhealy.github.io/gssr/. The GSS homepage is at http://gss.norc.org/. While the gssr
package incorporates the publicly-available GSS cumulative data file, the package is not associated with or endorsed by the National Opinion Research Center or the General Social Survey.