Appendix A — Getting Data Into R

Using data from an R package

To use data from an R package, for instance, the bechdel data from the fivethirtyeight package, you can simply load the relevant package with library and then the data frame will be available

# A tibble: 1,794 × 15
    year imdb      title  test  clean_test binary budget domgross intgross code 
   <int> <chr>     <chr>  <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
 1  2013 tt1711425 21 & … nota… notalk     FAIL   1.3 e7 25682380   4.22e7 2013…
 2  2012 tt1343727 Dredd… ok-d… ok         PASS   4.50e7 13414714   4.09e7 2012…
 3  2013 tt2024544 12 Ye… nota… notalk     FAIL   2   e7 53107035   1.59e8 2013…
 4  2013 tt1272878 2 Guns nota… notalk     FAIL   6.1 e7 75612460   1.32e8 2013…
 5  2013 tt0453562 42     men   men        FAIL   4   e7 95020213   9.50e7 2013…
 6  2013 tt1335975 47 Ro… men   men        FAIL   2.25e8 38362475   1.46e8 2013…
 7  2013 tt1606378 A Goo… nota… notalk     FAIL   9.2 e7 67349198   3.04e8 2013…
 8  2013 tt2194499 About… ok-d… ok         PASS   1.20e7 15323921   8.73e7 2013…
 9  2013 tt1814621 Admis… ok    ok         PASS   1.3 e7 18007317   1.80e7 2013…
10  2013 tt1815862 After… nota… notalk     FAIL   1.3 e8 60522097   2.44e8 2013…
# ℹ 1,784 more rows
# ℹ 5 more variables: budget_2013 <int>, domgross_2013 <dbl>,
#   intgross_2013 <dbl>, period_code <int>, decade_code <int>

Using read_rds to read in an R data set

We have provided the nnyfs.Rds data file on the course data page.

Suppose you have downloaded this data file into a directory on your computer called data which is a sub-directory of the directory where you plan to do your work, perhaps called 431-nnyfs.

Open RStudio and create a new project into the 431-nnyfs directory on your computer. You should see a data subdirectory in the Files window in RStudio after the project is created.

Now, read in the nnyfs.Rds file to a new tibble in R called nnyfs_new with the following command:

nnyfs_new <- read_rds("data/nnyfs.Rds")

Here are the results…

nnyfs_new
# A tibble: 1,518 × 45
    SEQN sex    age_child race_eth    educ_child language sampling_wt income_pov
   <dbl> <fct>      <dbl> <fct>            <dbl> <fct>          <dbl>      <dbl>
 1 71917 Female        15 3_Black No…          9 English       28299.       0.21
 2 71918 Female         8 3_Black No…          2 English       15127.       5   
 3 71919 Female        14 2_White No…          8 English       29977.       5   
 4 71920 Female        15 2_White No…          8 English       80652.       0.87
 5 71921 Male           3 2_White No…         NA English       55592.       4.34
 6 71922 Male          12 1_Hispanic           6 English       27365.       5   
 7 71923 Male          12 2_White No…          5 English       86673.       5   
 8 71924 Female         8 4_Other Ra…          2 English       39549.       2.74
 9 71925 Male           7 1_Hispanic           0 English       42333.       0.46
10 71926 Male           8 3_Black No…          2 English       15307.       1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <fct>, respondent <fct>,
#   salt_used <fct>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
#   diet_yesterday <fct>, water <dbl>, plank_time <dbl>, height <dbl>,
#   weight <dbl>, bmi <dbl>, bmi_cat <fct>, arm_length <dbl>, waist <dbl>,
#   arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
#   triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …

Using read_csv to read in a comma-separated version of a data file

We have provided the nnyfs.csv data file on the course data page.

Suppose you have downloaded this data file into a directory on your computer called data which is a sub-directory of the directory where you plan to do your work, perhaps called 431-nnyfs.

Open RStudio and create a new project into the 431-nnyfs directory on your computer. You should see a data subdirectory in the Files window in RStudio after the project is created.

Now, read in the nnyfs.csv file to a new tibble in R called nnyfs_new2 with the following command:

nnyfs_new2 <- read_csv("data/nnyfs.csv")
Rows: 1518 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): sex, race_eth, language, educ_adult, respondent, salt_used, diet_y...
dbl (27): SEQN, age_child, educ_child, sampling_wt, income_pov, age_adult, e...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nnyfs_new2
# A tibble: 1,518 × 45
    SEQN sex    age_child race_eth    educ_child language sampling_wt income_pov
   <dbl> <chr>      <dbl> <chr>            <dbl> <chr>          <dbl>      <dbl>
 1 71917 Female        15 3_Black No…          9 English       28299.       0.21
 2 71918 Female         8 3_Black No…          2 English       15127.       5   
 3 71919 Female        14 2_White No…          8 English       29977.       5   
 4 71920 Female        15 2_White No…          8 English       80652.       0.87
 5 71921 Male           3 2_White No…         NA English       55592.       4.34
 6 71922 Male          12 1_Hispanic           6 English       27365.       5   
 7 71923 Male          12 2_White No…          5 English       86673.       5   
 8 71924 Female         8 4_Other Ra…          2 English       39549.       2.74
 9 71925 Male           7 1_Hispanic           0 English       42333.       0.46
10 71926 Male           8 3_Black No…          2 English       15307.       1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <chr>, respondent <chr>,
#   salt_used <chr>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
#   diet_yesterday <chr>, water <dbl>, plank_time <dbl>, height <dbl>,
#   weight <dbl>, bmi <dbl>, bmi_cat <chr>, arm_length <dbl>, waist <dbl>,
#   arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
#   triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …

If you also want to convert the character variables to factors, as you will often want to do before analyzing the results, you should instead use:

nnyfs_new3 <- read_csv("data/nnyfs.csv") %>%
    mutate(across(where(is.character), as_factor))
Rows: 1518 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): sex, race_eth, language, educ_adult, respondent, salt_used, diet_y...
dbl (27): SEQN, age_child, educ_child, sampling_wt, income_pov, age_adult, e...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nnyfs_new3
# A tibble: 1,518 × 45
    SEQN sex    age_child race_eth    educ_child language sampling_wt income_pov
   <dbl> <fct>      <dbl> <fct>            <dbl> <fct>          <dbl>      <dbl>
 1 71917 Female        15 3_Black No…          9 English       28299.       0.21
 2 71918 Female         8 3_Black No…          2 English       15127.       5   
 3 71919 Female        14 2_White No…          8 English       29977.       5   
 4 71920 Female        15 2_White No…          8 English       80652.       0.87
 5 71921 Male           3 2_White No…         NA English       55592.       4.34
 6 71922 Male          12 1_Hispanic           6 English       27365.       5   
 7 71923 Male          12 2_White No…          5 English       86673.       5   
 8 71924 Female         8 4_Other Ra…          2 English       39549.       2.74
 9 71925 Male           7 1_Hispanic           0 English       42333.       0.46
10 71926 Male           8 3_Black No…          2 English       15307.       1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <fct>, respondent <fct>,
#   salt_used <fct>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
#   diet_yesterday <fct>, water <dbl>, plank_time <dbl>, height <dbl>,
#   weight <dbl>, bmi <dbl>, bmi_cat <fct>, arm_length <dbl>, waist <dbl>,
#   arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
#   triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …

Note that, for example, sex and race_eth are now listed as factor (fctr) variables. One place where this distinction between character and factor variables matters is when you summarize the data.

summary(nnyfs_new2$race_eth)
   Length     Class      Mode 
     1518 character character 
summary(nnyfs_new3$race_eth)
  3_Black Non-Hispanic   2_White Non-Hispanic             1_Hispanic 
                   338                    610                    450 
4_Other Race/Ethnicity 
                   120 

Converting Character Variables into Factors

The command you want to create newdata from olddata is:

newdata <- olddata %>%
    mutate(across(where(is.character), as_factor))

For more on factors, visit https://r4ds.had.co.nz/factors.html

Converting Data Frames to Tibbles

Use as_tibble() or simply tibble() to assign the attributes of a tibble to a data frame. Note that read_rds and read_csv automatically create tibbles.

For more on tibbles, visit https://r4ds.had.co.nz/tibbles.html.

For more advice

Consider visiting the software tutorials page under the R and Data heading on our main web site.