Discussion 11:Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I am using “Internet Users Country-wise 1980-2020” dataset from Kaggle to see the relationship between number of internet users vs cellular Subscription.

Here is the link to the dataset: https://www.kaggle.com/datasets/ashishraut64/internet-users

Content

The following dataset has information about internet users from 1980-2020. Details about the columns are as follows:

  • Entity - Contains the name of the countries and the regions.
  • Code - Information about country code and where code has the value ‘Region’, it denotes division by grouping various countries.
  • Year - Year from 1980-2020
  • Cellular Subscription - Mobile phone subscriptions per 100 people. This number can get over 100 when the average person has more than one subscription to a mobile service.
  • Internet Users(%) - The share of the population that is accessing the internet for all countries of the world.
  • No. of Internet Users - Number of people using the Internet in every country.
  • Broadband Subscription - The number of fixed broadband subscriptions per 100 people. This refers to fixed subscriptions to high-speed access to the public Internet (a TCP/IP connection), at downstream speeds equal to, or greater than, 256 kbit/s.

Load dataset

df<-read_csv("https://raw.githubusercontent.com/deepasharma06/Data-605/main/Internet%20user_data.csv")
## New names:
## Rows: 8867 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Entity, Code dbl (6): ...1, Year, Cellular Subscription, Internet
## Users(%), No. of Intern...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(df)
## # A tibble: 6 × 8
##    ...1 Entity      Code   Year `Cellular Subscription` Intern…¹ No. o…² Broad…³
##   <dbl> <chr>       <chr> <dbl>                   <dbl>    <dbl>   <dbl>   <dbl>
## 1     0 Afghanistan AFG    1980                       0        0       0       0
## 2     1 Afghanistan AFG    1981                       0        0       0       0
## 3     2 Afghanistan AFG    1982                       0        0       0       0
## 4     3 Afghanistan AFG    1983                       0        0       0       0
## 5     4 Afghanistan AFG    1984                       0        0       0       0
## 6     5 Afghanistan AFG    1985                       0        0       0       0
## # … with abbreviated variable names ¹​`Internet Users(%)`,
## #   ²​`No. of Internet Users`, ³​`Broadband Subscription`
summary(df)
##       ...1         Entity              Code                Year     
##  Min.   :   0   Length:8867        Length:8867        Min.   :1980  
##  1st Qu.:2216   Class :character   Class :character   1st Qu.:1990  
##  Median :4433   Mode  :character   Mode  :character   Median :2000  
##  Mean   :4433                                         Mean   :2000  
##  3rd Qu.:6650                                         3rd Qu.:2010  
##  Max.   :8866                                         Max.   :2020  
##  Cellular Subscription Internet Users(%)  No. of Internet Users
##  Min.   :  0.000       Min.   :  0.0000   Min.   :0.000e+00    
##  1st Qu.:  0.000       1st Qu.:  0.0000   1st Qu.:0.000e+00    
##  Median :  5.501       Median :  0.8557   Median :1.005e+04    
##  Mean   : 39.990       Mean   : 17.0436   Mean   :1.089e+07    
##  3rd Qu.: 82.232       3rd Qu.: 25.4499   3rd Qu.:8.664e+05    
##  Max.   :436.103       Max.   :100.0000   Max.   :4.700e+09    
##  Broadband Subscription
##  Min.   : 0.000        
##  1st Qu.: 0.000        
##  Median : 0.000        
##  Mean   : 4.441        
##  3rd Qu.: 2.008        
##  Max.   :78.524

Change column names

df <- df %>%
  rename("CellularSubscription" = "Cellular Subscription",
         "NoofInternetUsers" = "No. of Internet Users")

Check df to see the new column names

head(df)
## # A tibble: 6 × 8
##    ...1 Entity      Code   Year CellularSubscription Internet …¹ NoofI…² Broad…³
##   <dbl> <chr>       <chr> <dbl>                <dbl>       <dbl>   <dbl>   <dbl>
## 1     0 Afghanistan AFG    1980                    0           0       0       0
## 2     1 Afghanistan AFG    1981                    0           0       0       0
## 3     2 Afghanistan AFG    1982                    0           0       0       0
## 4     3 Afghanistan AFG    1983                    0           0       0       0
## 5     4 Afghanistan AFG    1984                    0           0       0       0
## 6     5 Afghanistan AFG    1985                    0           0       0       0
## # … with abbreviated variable names ¹​`Internet Users(%)`, ²​NoofInternetUsers,
## #   ³​`Broadband Subscription`

Data Preparation

First I will run bellow code to see if there is any missing values in my data.

print(colSums(is.na(df)))
##                   ...1                 Entity                   Code 
##                      0                      0                      0 
##                   Year   CellularSubscription      Internet Users(%) 
##                      0                      0                      0 
##      NoofInternetUsers Broadband Subscription 
##                      0                      0

We can see that there is no missing values in my data. So no further cleanup needed for this data.

Scatter plot of Cellular Subscription vs No. of Internet Users

df %>%
  ggplot(aes(x=CellularSubscription, y= NoofInternetUsers)) +
  geom_point()+
  geom_smooth(method='lm',na.rm=TRUE) +
  labs(x='Cellular Subscription',y='No of Internet Users',title='Cellular Subscription vs No of Internet Users')
## `geom_smooth()` using formula = 'y ~ x'

Simple Linear Model

lm <- lm(CellularSubscription ~NoofInternetUsers,data=df)
summary(lm)
## 
## Call:
## lm(formula = CellularSubscription ~ NoofInternetUsers, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102.06  -39.60  -34.27   41.54  396.48 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.960e+01  5.521e-01  71.725  < 2e-16 ***
## NoofInternetUsers 3.588e-08  4.404e-09   8.146 4.28e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.79 on 8865 degrees of freedom
## Multiple R-squared:  0.007429,   Adjusted R-squared:  0.007317 
## F-statistic: 66.35 on 1 and 8865 DF,  p-value: 4.284e-16
To see statistical value in tabular form
summ(lm)
Observations 8867
Dependent variable CellularSubscription
Type OLS linear regression
F(1,8865) 66.35
0.01
Adj. R² 0.01
Est. S.E. t val. p
(Intercept) 39.60 0.55 71.73 0.00
NoofInternetUsers 0.00 0.00 8.15 0.00
Standard errors: OLS

Normal Q-Q Plot to see the normality

# define residuals
res <- resid(lm)

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res, col = "red")

Or Use this to find the Q-Q plot
#qqnorm(lm$residuals)
#qqline(lm$residuals)

Details Diagnostic Plots to assess the model

par(mfrow=c(2,2))
plot(lm)

par(mfrow=c(1,1))

Conclusion:

Overall, applying simple linear regression is not appropriate for this cellular subscription and number of internet users. This means that the number of internet users is not linear to the number of cellular subscribers.