Appendix C — Getting Data Into R

C.1 Using data from an R package

To use data from an R package, for instance, the bechdel data from the fivethirtyeight package, you can simply load the relevant package with library and then the data frame will be available

# A tibble: 1,794 × 15
    year imdb      title  test  clean_test binary budget domgross intgross code 
   <int> <chr>     <chr>  <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
 1  2013 tt1711425 21 & … nota… notalk     FAIL   1.3 e7 25682380   4.22e7 2013…
 2  2012 tt1343727 Dredd… ok-d… ok         PASS   4.50e7 13414714   4.09e7 2012…
 3  2013 tt2024544 12 Ye… nota… notalk     FAIL   2   e7 53107035   1.59e8 2013…
 4  2013 tt1272878 2 Guns nota… notalk     FAIL   6.1 e7 75612460   1.32e8 2013…
 5  2013 tt0453562 42     men   men        FAIL   4   e7 95020213   9.50e7 2013…
 6  2013 tt1335975 47 Ro… men   men        FAIL   2.25e8 38362475   1.46e8 2013…
 7  2013 tt1606378 A Goo… nota… notalk     FAIL   9.2 e7 67349198   3.04e8 2013…
 8  2013 tt2194499 About… ok-d… ok         PASS   1.20e7 15323921   8.73e7 2013…
 9  2013 tt1814621 Admis… ok    ok         PASS   1.3 e7 18007317   1.80e7 2013…
10  2013 tt1815862 After… nota… notalk     FAIL   1.3 e8 60522097   2.44e8 2013…
# ℹ 1,784 more rows
# ℹ 5 more variables: budget_2013 <int>, domgross_2013 <dbl>,
#   intgross_2013 <dbl>, period_code <int>, decade_code <int>

For more on this example, visit Bechdel analysis using the tidyverse.

C.2 Using read_rds to read in an R data set

We have provided the nnyfs.Rds data file on the course data page.

Suppose you have downloaded this data file into a directory on your computer called data which is a sub-directory of the directory where you plan to do your work, perhaps called 431-nnyfs.

Open RStudio and create a new project into the 431-nnyfs directory on your computer. You should see a data subdirectory in the Files window in RStudio after the project is created.

Now, read in the nnyfs.Rds file to a new tibble in R called nnyfs with the following command:

nnyfs <- read_rds("data/nnyfs.Rds")

Here are the results…

nnyfs
# A tibble: 1,518 × 45
   SEQN  sex    age_child race_eth    educ_child language sampling_wt income_pov
   <chr> <fct>      <dbl> <fct>            <dbl> <fct>          <dbl>      <dbl>
 1 71917 Female        15 3_Black No…          9 English       28299.       0.21
 2 71918 Female         8 3_Black No…          2 English       15127.       5   
 3 71919 Female        14 2_White No…          8 English       29977.       5   
 4 71920 Female        15 2_White No…          8 English       80652.       0.87
 5 71921 Male           3 2_White No…         NA English       55592.       4.34
 6 71922 Male          12 1_Hispanic           6 English       27365.       5   
 7 71923 Male          12 2_White No…          5 English       86673.       5   
 8 71924 Female         8 4_Other Ra…          2 English       39549.       2.74
 9 71925 Male           7 1_Hispanic           0 English       42333.       0.46
10 71926 Male           8 3_Black No…          2 English       15307.       1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <fct>, respondent <fct>,
#   salt_used <fct>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
#   diet_yesterday <fct>, water <dbl>, plank_time <dbl>, height <dbl>,
#   weight <dbl>, bmi <dbl>, bmi_cat <fct>, arm_length <dbl>, waist <dbl>,
#   arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
#   triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …

C.3 Using read_csv to read in a comma-separated version of a data file

We have provided the fev_ros.csv data file on the course data page.

Suppose you have downloaded this data file into a directory on your computer called data.

Now, read in the fev_ros.csv file to a new tibble in R called fev_ros with the following command, assuming you also want to convert the character variables to factors, as you will often want to do before analyzing the results.

fev_ros <- read_csv("data/fev_ros.csv") |>
  mutate(across(where(is.character), as_factor))
Rows: 654 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sex, smoke
dbl (4): id, age, fev, height

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fev_ros
# A tibble: 654 × 6
      id   age   fev height sex    smoke             
   <dbl> <dbl> <dbl>  <dbl> <fct>  <fct>             
 1   301     9  1.71   57   female non-current smoker
 2   451     8  1.72   67.5 female non-current smoker
 3   501     7  1.72   54.5 female non-current smoker
 4   642     9  1.56   53   male   non-current smoker
 5   901     9  1.90   57   male   non-current smoker
 6  1701     8  2.34   61   female non-current smoker
 7  1752     6  1.92   58   female non-current smoker
 8  1753     6  1.42   56   female non-current smoker
 9  1901     8  1.99   58.5 female non-current smoker
10  1951     9  1.94   60   female non-current smoker
# ℹ 644 more rows

Note that, for example, sex and smoke are now listed as factor (fctr) variables.

For more on factors, visit https://r4ds.had.co.nz/factors.html.

Converting Data Frames to Tibbles

Use as_tibble() or simply tibble() to assign the attributes of a tibble to a data frame. Note that read_rds and read_csv automatically create tibbles.

For more on tibbles, visit https://r4ds.had.co.nz/tibbles.html.

For more advice